I have three disks in my system, divided into a mix of RAID-1 and RAID-5 partitions. I have /boot as a RAID-1 on MD0 (SDA1, SDB1), / as RAID-1 on MD1 (SDA2, SDB2), and the remainder as RAID-5 with LVM (SDA3, SDB3, SDC3). The intent is that for boot and root, if SDA fails, I can boot off SDB.
Every week I get an "Anacron job 'cron.weekly'" e-mail telling of: "WARNING: mismatch_cnt is not 0 on /dev/md1". If I look at /sys/block/md1/md/mismatch_cnt, I will see a number, usually 128, sometimes 64. If I manually invoke a scan, I will see the same number. Note that neither MD0 or MD3 report any errors.
If I set SDB1 bad, then remove then re-add it, it will rebuild fine, and a scan shows no errors. Scanning the messages (current and historical), I see no reports of medium errors reported for any of SD[ABC].
I have several questions:
1) Where can the errors be coming from? I would understand if a drive were reporting errors. Could it be during boot, one of the R-1 members is being written too before MD is started?
2) Sans drive errors messages, how to determine which drive is out of sync. I have resynced SDA2 to SDB2, but that is basically a coin flip; if SDB2 were correct, then I may have damaged files on SDA2. How can I determine where the mismatches occur, and then determine the file(s) potentially affected?
Thanks,
--rick
Rick Wagner on 12/28/2009 09:41 PM wrote:
- Where can the errors be coming from? I would understand if a drive were
reporting errors. Could it be during boot, one of the R-1 members is being written too before MD is started?
If you have X running, you can start palimpset and view SMART data. If you don't have X running, you can run "devkit-disks --dump" to view SMART data. Look for "reallocated sectors", "uncorrectable sectors", or "pending sectors", and if they are not showing 0 for current value, then you have bad sectors forming.
On 12/29/2009 09:27 AM, Michael Cronenworth wrote:
Rick Wagner on 12/28/2009 09:41 PM wrote:
- Where can the errors be coming from? I would understand if a
drive were reporting errors. Could it be during boot, one of the R-1 members is being written too before MD is started?
If you have X running, you can start palimpset and view SMART data. If you don't have X running, you can run "devkit-disks --dump" to view SMART data. Look for "reallocated sectors", "uncorrectable sectors", or "pending sectors", and if they are not showing 0 for current value, then you have bad sectors forming.
The OP already said he didn't have any drive errors...
I also have the same problem -- my RAID-1 array regularly has a non-zero mismatch_cnt. For a RAID-1 array, this isn't *necessarily* a problem, especially with swap or mmap'ed files on the array. See this post on the CentOS mailing list for more info:
http://lists.centos.org/pipermail/centos/2009-December/086667.html
Cheers, Raman
On 12/29/2009 12:29 PM, Michael Cronenworth wrote:
Raman Gupta on 12/29/2009 11:24 AM wrote:
The OP already said he didn't have any drive errors...
Where? He said he scanned log files but didn't look at SMART info. Unless he has smartd running, he won't see any. No need to be condescending.
My apologies... you're right -- bad assumption on my part.
Cheers, Raman
On Tuesday 29 December 2009 06:27:27 am Michael Cronenworth wrote:
Rick Wagner on 12/28/2009 09:41 PM wrote:
- Where can the errors be coming from? I would understand if a drive
were reporting errors. Could it be during boot, one of the R-1 members is being written too before MD is started?
If you have X running, you can start palimpset and view SMART data. If you don't have X running, you can run "devkit-disks --dump" to view SMART data. Look for "reallocated sectors", "uncorrectable sectors", or "pending sectors", and if they are not showing 0 for current value, then you have bad sectors forming.
Hi Michael,
Thanks for the suggestions. I did not find 'palimpset', but used the your 'devkit' suggestion. Looking at the physical device entries (i.e. 'sd[abc] [1234]?') did not show anything like error counts. I take that to mean there are no errors to be reported, though an affirmative entry of "no errors" errors would be more reassuring. These are fairly modern drives, so I would expect that SMART would be supported on them, but wonder a little seeing the entry: "ATA SMART: not available".
Looking forward at other replies: I scanned the logs for interesting messages on sd[abc], not specifically for medium errors. I expected that I had smartd running, but your later reply inspired me to verify, and I found it was not. Thanks for the reminder on that.
Below is a sample of the "devkit-disks --dump" for one of the physical devices, with the "not available" for the "ATA SMART" entry:
Other replies also imply that this is common if you have swap or mmaped files on the MD. swap is on separate partitions, but I suppose the system or service may have some mmaped files (this is the root partition).
Thanks for the suggestions, --rick
======================================================================== Showing information for /org/freedesktop/DeviceKit/Disks/devices/sda native-path: /sys/devices/pci0000:00/0000:00:05.0/host0/target0:0:0/0:0:0:0/block/sda device: 8:0 device-file: /dev/sda by-id: /dev/disk/by-id/ata-WDC_WD1001FALS-00J7B0_WD- WMATV0665717 by-id: /dev/disk/by-id/scsi-SATA_WDC_WD1001FALS-_WD- WMATV0665717 by-path: /dev/disk/by-path/pci-0000:00:05.0-scsi-0:0:0:0 detected at: Tue 29 Dec 2009 11:17:52 AM PST system internal: 1 removable: 0 has media: 1 (detected at Tue 29 Dec 2009 11:17:52 AM PST) detects change: 0 detection by polling: 0 detection inhibitable: 0 detection inhibited: 0 is read only: 0 is mounted: 0 mount paths: mounted by uid: 0 presentation hide: 0 presentation name: presentation icon: size: 1000204886016 block size: 512 job underway: no usage: type: version: uuid: label: partition table: scheme: mbr count: 4 drive: vendor: ATA model: WDC WD1001FALS-0 revision: 05.0 serial: WD-WMATV0665717 ejectable: 0 require eject: 0 media: compat: interface: ata if speed: (unknown) ATA SMART: not available
On 09-12-29 15:18:48, Rick Wagner wrote: ...
Thanks for the suggestions. I did not find 'palimpset', but used the
sp. palimpsest
your 'devkit' suggestion. Looking at the physical device entries (i.e. sd[abc][1234]?') did not show anything like error counts. I take that to mean there are no errors to be reported, though an affirmative entry of "no errors" errors would be more reassuring. These are fairly modern drives, so I would expect that SMART would be supported on them, but wonder a little seeing the entry: "ATA SMART: not available".
...
# smartctl -a /dev/sda
I expect that you've seen the reply that says what you are seeing is normal operation when memory-mapped files are written to. You might try flushing the buffers with a `sync ; sleep 2` before doing a scan.
On 12/29/2009 03:18 PM, Rick Wagner wrote:
On Tuesday 29 December 2009 06:27:27 am Michael Cronenworth wrote: Other replies also imply that this is common if you have swap or mmaped files on the MD. swap is on separate partitions, but I suppose the system or service may have some mmaped files (this is the root partition).
I believe this will list the open memory-mapped files on your root partition:
lsof -d mem -a +f -- /
Out of the resulting list you can probably eliminate libraries that are being read but not written to. If /var is on your root filesystem, then there will probably be a bunch of mmap'ed stuff in the various directories inside /var.
Cheers, Raman
On Tuesday 29 December 2009 10:17:32 pm Raman Gupta wrote:
On 12/29/2009 03:18 PM, Rick Wagner wrote:
On Tuesday 29 December 2009 06:27:27 am Michael Cronenworth wrote: Other replies also imply that this is common if you have swap or mmaped files on the MD. swap is on separate partitions, but I suppose the system or service may have some mmaped files (this is the root partition).
I believe this will list the open memory-mapped files on your root partition:
lsof -d mem -a +f -- /
Out of the resulting list you can probably eliminate libraries that are being read but not written to. If /var is on your root filesystem, then there will probably be a bunch of mmap'ed stuff in the various directories inside /var.
Cheers, Raman
Indeed, /var is on this volume, and lsof shows quite a number of mmaped files in /var/cache and /var/tmp. Thank you all for your help on this, I will relax now.
--rick
Eero Tamminen on 12/30/2009 04:09 AM wrote:
Indeed, /var is on this volume, and lsof shows quite a number of mmaped files in /var/cache and /var/tmp. Thank you all for your help on this, I will relax now.
Would you feel up to creating a bug against mdadm (the owner of 99-raid-check) and ask for a better description to the warning? Rough example:" "WARNING: mismatch_cnt not 0 on /dev/$dev, not harmful, but repaired with 'echo repair > /sys/block/md#/md/sync_action'"
On Mon, 28 Dec 2009 19:41:53 -0800 Rick Wagner rjwgnr27@verizon.net wrote:
I have three disks in my system, divided into a mix of RAID-1 and RAID-5 partitions. I have /boot as a RAID-1 on MD0 (SDA1, SDB1), / as RAID-1 on MD1 (SDA2, SDB2), and the remainder as RAID-5 with LVM (SDA3, SDB3, SDC3). The intent is that for boot and root, if SDA fails, I can boot off SDB.
Every week I get an "Anacron job 'cron.weekly'" e-mail telling of: "WARNING: mismatch_cnt is not 0 on /dev/md1". If I look at /sys/block/md1/md/mismatch_cnt, I will see a number, usually 128, sometimes 64. If I manually invoke a scan, I will see the same number. Note that neither MD0 or MD3 report any errors.
If I set SDB1 bad, then remove then re-add it, it will rebuild fine, and a scan shows no errors. Scanning the messages (current and historical), I see no reports of medium errors reported for any of SD[ABC].
I have several questions:
- Where can the errors be coming from? I would understand if a
drive were reporting errors. Could it be during boot, one of the R-1 members is being written too before MD is started?
This warning means that there are blocks that are not identical between your raid drives, in free space.
- Sans drive errors messages, how to determine which drive is out of
sync. I have resynced SDA2 to SDB2, but that is basically a coin flip; if SDB2 were correct, then I may have damaged files on SDA2. How can I determine where the mismatches occur, and then determine the file(s) potentially affected?
No need to. Just ask it to run a repair.
echo repair >/sys/block/md<#>/md/sync_action
then another check:
echo check >/sys/block/md<#>/md/sync_action
kevin