Too much time wasted installing fedora on raid1

List overview All Threads
Download

newer

older

ssh to a FC5 test1 system fails,...

Random X crashes: fact or fiction?

Paul P Komkoff Jr

19 Dec 2005 19 Dec '05

8:16 p.m.

Hello.

Last N systems I've installed, I am usually assembling software raid1 at install time. The disk layout I'm using is trivial: 100MB + the rest, where on 1st 100MB is /dev/md0 (ext3, /boot) and the rest is /dev/md1 (lvm, other).

For whatever reason (haven't looked at the code yet :() anaconda issuing mdadm commands in such order that primary replica of unsynched md0 is second drive, and raid1 resync started first on md1.

So, to boot without changing bios settings back and forth (or without disabling 1st disk drive by any other means) it is nesessary to wait while md1 raid1 resyncs, which can be as long as 2 hours (250GB HDD).

I've not tried fc5 test1 yet, but I'm pretty sure that without any additional steps to reslove this issue it couldn't just vaporize. So, is it possible to change anaconda wrt described issue?

-- Paul P 'Stingray' Komkoff Jr // http://stingr.net/key <- my pgp key This message represents the official view of the voices in my head

Show replies by date

Arjan van de Ven

19 Dec 19 Dec

9:12 p.m.

On Mon, 2005-12-19 at 23:16 +0300, Paul P Komkoff Jr wrote:

...

Hello.

Last N systems I've installed, I am usually assembling software raid1 at install time. The disk layout I'm using is trivial: 100MB + the rest, where on 1st 100MB is /dev/md0 (ext3, /boot) and the rest is /dev/md1 (lvm, other).

For whatever reason (haven't looked at the code yet :() anaconda issuing mdadm commands in such order that primary replica of unsynched md0 is second drive, and raid1 resync started first on md1.

So, to boot without changing bios settings back and forth (or without disabling 1st disk drive by any other means) it is nesessary to wait while md1 raid1 resyncs, which can be as long as 2 hours (250GB HDD).

I'm pretty sure that anaconda at some point deliberately did things to avoid that sync. So I suspect it got reenabled for a good reason...

Paul P Komkoff Jr

20 Dec 20 Dec

8:56 a.m.

Replying to Arjan van de Ven:

...

...
So, to boot without changing bios settings back and forth (or without disabling 1st disk drive by any other means) it is nesessary to wait while md1 raid1 resyncs, which can be as long as 2 hours (250GB HDD).

I'm pretty sure that anaconda at some point deliberately did things to avoid that sync. So I suspect it got reenabled for a good reason...

There's nothing wrong in sync itself, but the order. If /boot will be synched first then it will be safe to reboot as soon as actual installation is complete (and root volume reconstruction can happen later)

-- Paul P 'Stingray' Komkoff Jr // http://stingr.net/key <- my pgp key This message represents the official view of the voices in my head

Arjan van de Ven

9:46 a.m.

On Tue, 2005-12-20 at 11:56 +0300, Paul P Komkoff Jr wrote:

...

Replying to Arjan van de Ven:

...
...
So, to boot without changing bios settings back and forth (or without disabling 1st disk drive by any other means) it is nesessary to wait while md1 raid1 resyncs, which can be as long as 2 hours (250GB HDD).

I'm pretty sure that anaconda at some point deliberately did things to avoid that sync. So I suspect it got reenabled for a good reason...

There's nothing wrong in sync itself, but the order. If /boot will be synched first then it will be safe to reboot as soon as actual installation is complete (and root volume reconstruction can happen later)

well... if the root is brand spanking new.. why would it need a sync?

Ralf Ertzinger

10:13 a.m.

On Tue, Dec 20, 2005 at 10:46:18AM +0100, Arjan van de Ven wrote:

...

well... if the root is brand spanking new.. why would it need a sync?

Because it _is_ brand spanking new (and thus unclean from a RAID point of view)?

Arjan van de Ven

10:19 a.m.

On Tue, 2005-12-20 at 11:13 +0100, Ralf Ertzinger wrote:

...

On Tue, Dec 20, 2005 at 10:46:18AM +0100, Arjan van de Ven wrote:

...
well... if the root is brand spanking new.. why would it need a sync?

Because it _is_ brand spanking new (and thus unclean from a RAID point of view)?

but.. it might as well be marked clean. there's no information in never-used parts of the disk after all... so why sync it instead of pretending it's synced perfectly fine....

Paul P Komkoff Jr

10:26 a.m.

Replying to Arjan van de Ven:

...

...
Because it _is_ brand spanking new (and thus unclean from a RAID point of view)?

but.. it might as well be marked clean. there's no information in never-used parts of the disk after all... so why sync it instead of pretending it's synced perfectly fine....

raid1 means that contents of both mirrors are exactly the same. Just after creation, this is obviously not true. Linux MD implementation solving this problem by synching it.

Other solution may be to store list of written blocks and do something with this, but, I think this isn't reliable/scalable.

Anyway, as I said previously, root volume reconstruction can happily continue after firstboot (and second boot too), but inconsistency in /boot making system unbootable.

-- Paul P 'Stingray' Komkoff Jr // http://stingr.net/key <- my pgp key This message represents the official view of the voices in my head

Denis Leroy

11:24 a.m.

Paul P Komkoff Jr wrote:

...

Replying to Arjan van de Ven:

...
...
Because it _is_ brand spanking new (and thus unclean from a RAID point of view)?

but.. it might as well be marked clean. there's no information in never-used parts of the disk after all... so why sync it instead of pretending it's synced perfectly fine....

raid1 means that contents of both mirrors are exactly the same. Just after creation, this is obviously not true. Linux MD implementation solving this problem by synching it.

No but if you plan to format the partition right after creating the RAID, which is usually the case, you can force the raid clean: sectors will only be written (and written clean to both disks), never read first. Well at least on paper.

David Woodhouse

12:35 p.m.

On Tue, 2005-12-20 at 11:19 +0100, Arjan van de Ven wrote:

...

but.. it might as well be marked clean. there's no information in never-used parts of the disk after all... so why sync it instead of pretending it's synced perfectly fine....

Because it's RAID, and it doesn't know anything about the file system -- it's just trying naïvely to keep the _block_ device synchronised, without any of that helpful knowledge about which blocks the file system actually cares about, or might have touched since the array was last known to be consistent.

This is one of the reasons I believe RAID is the wrong solution; a better option would be to have the file system natively handle redundancy across block devices, rather than having two separate layers each of which know nothing about the other.

Imagine a 'RAID resync' which only needed to resync those blocks which the file system's journal says have been modified since the state was last consistent...

-- dwmw2

Nicolas Mailhot

1:06 p.m.

On Mar 20 décembre 2005 13:35, David Woodhouse wrote:

...

On Tue, 2005-12-20 at 11:19 +0100, Arjan van de Ven wrote:

...
but.. it might as well be marked clean. there's no information in never-used parts of the disk after all... so why sync it instead of pretending it's synced perfectly fine....

Because it's RAID, and it doesn't know anything about the file system -- it's just trying naïvely to keep the _block_ device synchronised, without any of that helpful knowledge about which blocks the file system actually cares about, or might have touched since the array was last known to be consistent.

I don't know how anaconda handles it but the default raid IO limits are very conservative (optimized for background work on a live system)

On a system being created it's probably a better idea to tell the raid system to pump as much blocks as possible, since this is a blocking op in this context and we don't care about anything else.

-- Nicolas Mailhot

David Woodhouse

1:20 p.m.

On Tue, 2005-12-20 at 14:06 +0100, Nicolas Mailhot wrote:

...

I don't know how anaconda handles it but the default raid IO limits are very conservative (optimized for background work on a live system)

On a system being created it's probably a better idea to tell the raid system to pump as much blocks as possible, since this is a blocking op in this context and we don't care about anything else.

I think you miss the point. On a file system which is being created, the bandwidth limits _ought_ to be fairly much irrelevant. The file system is _empty_, and the total amount of I/O to create an empty RAID-1 ought to be almost exactly the same amount of I/O which it takes just to run 'mkfs' twice. The blocks which aren't yet used by the filesystem don't matter; you shouldn't need to sync their contents.

RAID is just a hack which hides the redundancy from the file system; the kind of thing we always used to have to do for DOS compatibility so that we could provide an INT 13h handler and have it 'just work'. It's much the same reasoning which led to the built-in 'translation layers' on cheap CF cards and USB sticks which so often eat your data and which have a tendency to use up their limited lifetime in copying obsolete sectors belonging to deleted files around on the underlying medium.

We don't _need_ that kind of layering in Linux. Have a library of functions which can be used for such redundancy, by all means -- but don't just make RAID pretend to be a 'standard' block device and prevent all possibility of sharing knowledge between the file system and the layer providing the redundancy. That's just insane.

-- dwmw2

Paul P Komkoff Jr

1:45 p.m.

Replying to David Woodhouse:

...

RAID is just a hack which hides the redundancy from the file system; the kind of thing we always used to have to do for DOS compatibility so that we could provide an INT 13h handler and have it 'just work'. It's much the same reasoning which led to the built-in 'translation layers' on cheap CF cards and USB sticks which so often eat your data and which have a tendency to use up their limited lifetime in copying obsolete sectors belonging to deleted files around on the underlying medium.

We don't _need_ that kind of layering in Linux. Have a library of functions which can be used for such redundancy, by all means -- but don't just make RAID pretend to be a 'standard' block device and prevent all possibility of sharing knowledge between the file system and the layer providing the redundancy. That's just insane.

Well, everything you said habove is correct. But I've yet seen any "redundant block device filesystem" project anywhere. Anybody willing to start it?

Meanwhile, maybe we can fix md sync ordering? :)))

-- Paul P 'Stingray' Komkoff Jr // http://stingr.net/key <- my pgp key This message represents the official view of the voices in my head

Avi Kivity

24 Dec 24 Dec

4:26 p.m.

Paul P Komkoff Jr wrote:

...

Replying to David Woodhouse:

...
RAID is just a hack which hides the redundancy from the file system; the kind of thing we always used to have to do for DOS compatibility so that we could provide an INT 13h handler and have it 'just work'. It's much the same reasoning which led to the built-in 'translation layers' on cheap CF cards and USB sticks which so often eat your data and which have a tendency to use up their limited lifetime in copying obsolete sectors belonging to deleted files around on the underlying medium.

We don't _need_ that kind of layering in Linux. Have a library of functions which can be used for such redundancy, by all means -- but don't just make RAID pretend to be a 'standard' block device and prevent all possibility of sharing knowledge between the file system and the layer providing the redundancy. That's just insane.

Well, everything you said habove is correct. But I've yet seen any "redundant block device filesystem" project anywhere.

Sun zfs, Isilon OneFS, possibly others.

-- Do not meddle in the internals of kernels, for they are subtle and quick to panic.

Nicolas Mailhot

20 Dec 20 Dec

3:01 p.m.

On Mar 20 décembre 2005 14:20, David Woodhouse wrote:

...

On Tue, 2005-12-20 at 14:06 +0100, Nicolas Mailhot wrote:

...
I don't know how anaconda handles it but the default raid IO limits are very conservative (optimized for background work on a live system)

On a system being created it's probably a better idea to tell the raid system to pump as much blocks as possible, since this is a blocking op in this context and we don't care about anything else.

I think you miss the point. On a file system which is being created, the bandwidth limits _ought_ to be fairly much irrelevant.

No they're not. If I remember well during an update anaconda can transform some existing logical volumes to redundant volumes, so you can have a fairly big mass of data to transfer.

And the bandwidth limits are *very* conservative. You won't win 5% or even half the time, sometimes it's a 10x (+) différence.

-- Nicolas Mailhot

Alan Cox

22 Dec 22 Dec

11:25 a.m.

On Tue, Dec 20, 2005 at 12:35:17PM +0000, David Woodhouse wrote:

...

Imagine a 'RAID resync' which only needed to resync those blocks which the file system's journal says have been modified since the state was last consistent...

The fact Linux md doesn't currently deal with that isn't about fs/md splits its about 'nobody has written it'. High level raid stuff tends to keep dirty bitmaps and journals them so that the consistency of any stripe can be established.

David Woodhouse

11:55 a.m.

On Thu, 2005-12-22 at 06:25 -0500, Alan Cox wrote:

...

The fact Linux md doesn't currently deal with that isn't about fs/md splits its about 'nobody has written it'. High level raid stuff tends to keep dirty bitmaps and journals them so that the consistency of any stripe can be established.

The same information is already available in the file system's own data structures; it shouldn't be necessary for the RAID layer to do that for itself. _That_ kind of redundancy isn't what RAID is supposed to be about.

But yes, there are always band-aids which can help to improve any design flaw.

-- dwmw2

Dumitru Ciobarcianu

12:52 p.m.

În data de Jo, 22-12-2005 la 11:55 +0000, David Woodhouse a scris:

...

On Thu, 2005-12-22 at 06:25 -0500, Alan Cox wrote:

...
The fact Linux md doesn't currently deal with that isn't about fs/md splits its about 'nobody has written it'. High level raid stuff tends to keep dirty bitmaps and journals them so that the consistency of any stripe can be established.

The same information is already available in the file system's own data structures; it shouldn't be necessary for the RAID layer to do that for itself. _That_ kind of redundancy isn't what RAID is supposed to be about.

What if I use the raid device without any filesystem on it ?

-- Cioby

Christof Damian

1:09 p.m.

On Thu, 22 Dec 2005, David Woodhouse wrote:

...

The same information is already available in the file system's own data structures; it shouldn't be necessary for the RAID layer to do that for itself. _That_ kind of redundancy isn't what RAID is supposed to be about.

But yes, there are always band-aids which can help to improve any design flaw.

I rather like the separation between file system and raid devices. It gives more choice and makes both components simpler. Now I can choose raid1+ext2, raid5+ext3, swap+raid0 and reiser+device.

I would rather like not to loose those options. Otherwise it would be like combining "uniq" & "sort", because you could make one case work better.

christof

-- Christof Damian christof@damian.net

Alan Cox

2:33 p.m.

On Thu, Dec 22, 2005 at 11:55:44AM +0000, David Woodhouse wrote:

...

structures; it shouldn't be necessary for the RAID layer to do that for itself. _That_ kind of redundancy isn't what RAID is supposed to be about.

Its a neccessary part of RAID because duplicating raid functionality in every file system would be incredibly inefficient and lead to a lot of code duplication and bug. Its also neccessary because the raid functionality may not even be on the same host as the file system.

...

But yes, there are always band-aids which can help to improve any design flaw.

You need to learn the difference between a design flaw and pragmatic design.

"Perfection is the enemy of success"

Paul A Houle

6:22 p.m.

Alan Cox wrote:

...

Its a neccessary part of RAID because duplicating raid functionality in every file system would be incredibly inefficient and lead to a lot of code duplication and bug. Its also neccessary because the raid functionality may not even be on the same host as the file system.

Seen the RAID-Z system in Solaris 10?

http://blogs.sun.com/roller/page/bonwick?entry=raid_z

Or the Write Anywhere Filesystem used by Netfilter?

http://www.netapp.com/library/tr/3002.pdf

It's clear that you can get better performance and reliability by codesigning a filesystem with the block layer. I think of the difference between old-school Unix (KISS) and System 370: do you build a system that's simple and clean, or do you take the extra effort to maximize performance. Sun can afford to do this because it isn't wasting energy on maintaining 30 filesystems that almost work, but rather focusing on one that does.

David Woodhouse

23 Dec 23 Dec

5:27 p.m.

On Thu, 2005-12-22 at 09:33 -0500, Alan Cox wrote:

...

Its a neccessary part of RAID because duplicating raid functionality in every file system would be incredibly inefficient and lead to a lot of code duplication and bug. Its also neccessary because the raid functionality may not even be on the same host as the file system.

A bit like the way that duplicating ATA functionality in every SATA host driver also leads to a lot of code duplication and bugs? Or the way that duplicating zlib in everything that needs compression leads to a lot of code duplication and bugs?

The former doesn't happen; the latter was a bug which got fixed.

-- dwmw2

David Woodhouse

24 Dec 24 Dec

11:58 a.m.

On Thu, 2005-12-22 at 09:33 -0500, Alan Cox wrote:

...

You need to learn the difference between a design flaw and pragmatic design.

It varies, but normally it's somewhere between three and seven years in my experience. :)

-- dwmw2

Peter Jones

20 Dec 20 Dec

4:57 p.m.

On Tue, 2005-12-20 at 11:13 +0100, Ralf Ertzinger wrote:

...

On Tue, Dec 20, 2005 at 10:46:18AM +0100, Arjan van de Ven wrote:

...
well... if the root is brand spanking new.. why would it need a sync?

Because it _is_ brand spanking new (and thus unclean from a RAID point of view)?

Why does that matter? The parts of it we've written data out to are supposed to be in sync already. The resync should only be making the parts that aren't in use yet identical.

-- Peter

Denis Leroy

6:17 p.m.

Peter Jones wrote:

...

On Tue, 2005-12-20 at 11:13 +0100, Ralf Ertzinger wrote:

...
On Tue, Dec 20, 2005 at 10:46:18AM +0100, Arjan van de Ven wrote:

...
well... if the root is brand spanking new.. why would it need a sync?

Because it _is_ brand spanking new (and thus unclean from a RAID point of view)?

Why does that matter? The parts of it we've written data out to are supposed to be in sync already. The resync should only be making the parts that aren't in use yet identical.

Exactly, it's like trying to sync uninitialized memory. Now i believe the --assume-clean option of mdadm only works for RAID1 unfortunately.

Also, the layered approach of RAID is why RAID "just works", trying to build the redundancy into the filesystem itself would be pure madness. There are many excellent OS design books that will explain this much better than i ever could.

David Woodhouse

21 Dec 21 Dec

12:28 p.m.

On Tue, 2005-12-20 at 10:17 -0800, Denis Leroy wrote:

...

Also, the layered approach of RAID is why RAID "just works", trying to build the redundancy into the filesystem itself would be pure madness. There are many excellent OS design books that will explain this much better than i ever could.

RAID 'just works' in much the same way that microkernels and the OSI networking model 'work' -- suboptimally. It's a symptom of a popular design methodology which if allowed free rein would make Modula-3¹ programmers of us all.

I'm sure there are many OS design books which Linus could have read that would have explained why microkernels were better, too. Personally, I'm happy that he wasn't persuaded.

The next file system I write will probably do this correctly, but it'll be another flash file system, so won't be relevant to block devices.

-- dwmw2 ¹ Or maybe Java, these days. The bondage and discipline principle is much the same.

6762

Age (days ago)

6767

Last active (days ago)

devel@lists.fedoraproject.org

24 comments

12 participants

tags (0)

participants (12)

Alan Cox
Arjan van de Ven
Avi Kivity
Christof Damian
David Woodhouse
Denis Leroy
Dumitru Ciobarcianu
Nicolas Mailhot
Paul A Houle
Paul P Komkoff Jr
Peter Jones
Ralf Ertzinger