Hi all,
I wonder if I could ask your advice about a very strange SCSI tape problem I'm having. I have an Overland Neo2000 tapelibrary containing dual Quantum SDLT2 drives, attached to a dual Opteron box via an Adaptec 3920 controller. The system is running FC3 up2date'd on April 6th and with kernel 2.6.10-1.770_FC3smp.
The problem appears to be that the tapes drop records when being written, but that this can be fixed by increasing the blocksize.
I have a bunch of test files filled with random numbers: e.g. -rw-r--r-- 1 root root 500000 Apr 26 11:03 tapetest500K -rw-r--r-- 1 root root 1000000000 Apr 26 12:48 /var/tapetest1G
When I write these to tape and read them back, I get things like the following:
[root@ls1 ~]$ dd if=tapetest500K of=/dev/st1 bs=10240 ; dd if=/dev/st1 of=junk bs=10240 ; cmp tapetest500K junk 48+1 records in 48+1 records out 36+1 records in 36+1 records out tapetest500K junk differ: byte 215041, line 854 [root@ls1 ~]$ dd if=tapetest500K of=/dev/st1 bs=20480 ; dd if=/dev/st1 of=junk bs=20480 ; cmp tapetest500K junk 24+1 records in 24+1 records out 24+1 records in 24+1 records out [root@ls1 ~]$ ( cd /var ; dd if=tapetest1G of=/dev/st1 bs=10240 ; dd if=/dev/st1 of=junk bs=10240 ; cmp tapetest1G junk ) 97656+1 records in 97656+1 records out 90107+1 records in 90107+1 records out tapetest1G junk differ: byte 215041, line 866 [root@ls1 ~]$ ( cd /var ; dd if=tapetest1G of=/dev/st1 bs=20480 ; dd if=/dev/st1 of=junk bs=20480 ; cmp tapetest1G junk ) 48828+1 records in 48828+1 records out 48801+1 records in 48801+1 records out tapetest1G junk differ: byte 2621441, line 10186 [root@ls1 ~]$ (cd /var ; dd if=tapetest1G of=/dev/st1 bs=32768 ; dd if=/dev/st1 of=junk bs=32768 ; cmp tapetest1G junk) 30517+1 records in 30517+1 records out 30517+1 records in 30517+1 records out [root@ls1 ~]$
The problem happens on both tapedrives. No errors are logged in /var/log/messages. The Neo2000 system performs faultlessly when attached to my Alpha box.
Does this look to you like a hardware problem? Or could it be a driver issue? Unfortunately I dont have enough spare kit to try things on another box.
Regards, Terry
T. Horsnell wrote: ...
The problem appears to be that the tapes drop records when being written, but that this can be fixed by increasing the blocksize.
...
Are you sure that it is during the write operation?
[root@ls1 ~]$ dd if=tapetest500K of=/dev/st1 bs=10240 ; dd if=/dev/st1 of=junk bs=10240 ; cmp tapetest500K junk 48+1 records in 48+1 records out 36+1 records in 36+1 records out tapetest500K junk differ: byte 215041, line 854
If you add another
dd if=/dev/st1 of=junk2 bs=10240
does that equal junk or is the error in another position?
The library has two drives, right? Is it a problem on both drives?
Mogens
T. Horsnell wrote: ...
The problem appears to be that the tapes drop records when being written, but that this can be fixed by increasing the blocksize.
...
Are you sure that it is during the write operation?
[root@ls1 ~]$ dd if=tapetest500K of=/dev/st1 bs=10240 ; dd if=/dev/st1 of=junk bs=10240 ; cmp tapetest500K junk 48+1 records in 48+1 records out 36+1 records in 36+1 records out tapetest500K junk differ: byte 215041, line 854
If you add another
dd if=/dev/st1 of=junk2 bs=10240
does that equal junk or is the error in another position?
[root@ls1 ~]$ dd if=tapetest500K of=/dev/st1 bs=10240 ; dd if=/dev/st1 of=junk bs=10240 ; cmp tapetest500K junk 48+1 records in 48+1 records out 34+0 records in 34+0 records out tapetest500K junk differ: byte 153601, line 593 [root@ls1 ~]$ dd if=/dev/st1 of=junk2 bs=10240; cmp junk junk2 34+0 records in 34+0 records out [root@ls1 ~]$ dd if=tapetest50M of=/dev/st1 bs=10240 ; dd if=/dev/st1 of=junk bs=10240 ; cmp tapetest50M junk 4882+1 records in 4882+1 records out 2576+1 records in 2576+1 records out tapetest50M junk differ: byte 153601, line 628 [root@ls1 ~]$ df -k . Filesystem 1K-blocks Used Available Use% Mounted on /dev/sda7 1004024 456760 496260 48% / [root@ls1 ~]$ dd if=/dev/st1 of=junk2 bs=10240 2576+1 records in 2576+1 records out [root@ls1 ~]$ diff junk junk2
Furthermore, looking at the byte position at which cmp fails, its always on an exact record boundary. Also, I have a tape utility which can tell me the size of tape records. Using this I see that the last record (the fractional part if the filesize is not an integral number of tape-blocks) is always the correct size (I dont know whether it has the correct contents).
The problem still occurs if I make the filesize an integral number of tape-records.
I'll put the tapelibrary back on my Alpha box and check what I get when I read tapes which were written on the Opteron. This should clinch the theory that its a write problem (or not, as the case may be)
The library has two drives, right? Is it a problem on both drives?
Yes. (See the end of my first post).
Cheers, Terry.
Mogens
-- Mogens Kjaer, Carlsberg A/S, Computer Department Gamle Carlsberg Vej 10, DK-2500 Valby, Denmark Phone: +45 33 27 53 25, Fax: +45 33 27 47 08 Email: mk@crc.dk Homepage: http://www.crc.dk
-- fedora-list mailing list fedora-list@redhat.com To unsubscribe: http://www.redhat.com/mailman/listinfo/fedora-list
T. Horsnell wrote: ...
The problem appears to be that the tapes drop records when being written, but that this can be fixed by increasing the blocksize.
...
Are you sure that it is during the write operation?
[root@ls1 ~]$ dd if=tapetest500K of=/dev/st1 bs=10240 ; dd if=/dev/st1 of=junk bs=10240 ; cmp tapetest500K junk 48+1 records in 48+1 records out 36+1 records in 36+1 records out tapetest500K junk differ: byte 215041, line 854
If you add another
dd if=/dev/st1 of=junk2 bs=10240
does that equal junk or is the error in another position?
[root@ls1 ~]$ dd if=tapetest500K of=/dev/st1 bs=10240 ; dd if=/dev/st1 of=junk bs=10240 ; cmp tapetest500K junk 48+1 records in 48+1 records out 34+0 records in 34+0 records out tapetest500K junk differ: byte 153601, line 593 [root@ls1 ~]$ dd if=/dev/st1 of=junk2 bs=10240; cmp junk junk2 34+0 records in 34+0 records out [root@ls1 ~]$ dd if=tapetest50M of=/dev/st1 bs=10240 ; dd if=/dev/st1 of=junk bs=10240 ; cmp tapetest50M junk 4882+1 records in 4882+1 records out 2576+1 records in 2576+1 records out tapetest50M junk differ: byte 153601, line 628 [root@ls1 ~]$ df -k . Filesystem 1K-blocks Used Available Use% Mounted on /dev/sda7 1004024 456760 496260 48% / [root@ls1 ~]$ dd if=/dev/st1 of=junk2 bs=10240 2576+1 records in 2576+1 records out [root@ls1 ~]$ diff junk junk2
Furthermore, looking at the byte position at which cmp fails, its always on an exact record boundary. Also, I have a tape utility which can tell me the size of tape records. Using this I see that the last record (the fractional part if the filesize is not an integral number of tape-blocks) is always the correct size (I dont know whether it has the correct contents).
I'm lying here. All the ones I've checked so far had the correct remnant-record-size, but I see from my test above, that one of them doesnt ('34+0 records in' indicates no remnant).
The problem still occurs if I make the filesize an integral number of tape-records.
I'll put the tapelibrary back on my Alpha box and check what I get when I read tapes which were written on the Opteron. This should clinch the theory that its a write problem (or not, as the case may be)
When I put the lib back onto the Alpha, two test files (tapetest500K and tapetest50M) written on the Opteron and read back on the Alpha, matched the files as read back on the Opteron. This convinces me that its (at least) a 'write' problem.
The library has two drives, right? Is it a problem on both drives?
Yes. (See the end of my first post).
Cheers, Terry.
Mogens
-- Mogens Kjaer, Carlsberg A/S, Computer Department Gamle Carlsberg Vej 10, DK-2500 Valby, Denmark Phone: +45 33 27 53 25, Fax: +45 33 27 47 08 Email: mk@crc.dk Homepage: http://www.crc.dk
-- fedora-list mailing list fedora-list@redhat.com To unsubscribe: http://www.redhat.com/mailman/listinfo/fedora-list
-- fedora-list mailing list fedora-list@redhat.com To unsubscribe: http://www.redhat.com/mailman/listinfo/fedora-list
T. Horsnell wrote: ...
When I put the lib back onto the Alpha, two test files (tapetest500K and tapetest50M) written on the Opteron and read back on the Alpha, matched the files as read back on the Opteron. This convinces me that its (at least) a 'write' problem.
Is it just the library you move, or can you also try to move the SCSI controller and cables?
The library has two drives, right? Is it a problem on both drives?
Yes. (See the end of my first post).
Sorry, I missed this line.
Mogens
T. Horsnell wrote: ...
When I put the lib back onto the Alpha, two test files (tapetest500K and tapetest50M) written on the Opteron and read back on the Alpha, matched the files as read back on the Opteron. This convinces me that its (at least) a 'write' problem.
Is it just the library you move, or can you also try to move the SCSI controller and cables?
I just move the library, but I've swapped cables as well. I also contacted the author of the st driver and he reckons I should try a different sort of SCSI adapter. I'm coming to the same conclusion, but perhaps I'll try the author of the aic79xx driver first, since I'll have to buy an adapter in order to try things.
Cheers, Terry.
The library has two drives, right? Is it a problem on both drives?
Yes. (See the end of my first post).
Sorry, I missed this line.
Mogens
Mogens Kjaer, Carlsberg A/S, Computer Department Gamle Carlsberg Vej 10, DK-2500 Valby, Denmark Phone: +45 33 27 53 25, Fax: +45 33 27 47 08 Email: mk@crc.dk Homepage: http://www.crc.dk
-- fedora-list mailing list fedora-list@redhat.com To unsubscribe: http://www.redhat.com/mailman/listinfo/fedora-list
T. Horsnell wrote: ...
When I put the lib back onto the Alpha, two test files (tapetest500K and tapetest50M) written on the Opteron and read back on the Alpha, matched the files as read back on the Opteron. This convinces me that its (at least) a 'write' problem.
Is it just the library you move, or can you also try to move the SCSI controller and cables?
I just move the library, but I've swapped cables as well. I also contacted the author of the st driver and he reckons I should try a different sort of SCSI adapter. I'm coming to the same conclusion, but perhaps I'll try the author of the aic79xx driver first, since I'll have to buy an adapter in order to try things.
I found A PC running FC3 (kernel 2.6.10-1.741_FC3) which had an Adaptec 29160 SCSI adapter in a 32-bit PCI slot. I connected my SDLT2 library to it and repeated my tests. Everything worked!.
Adaptec 39320 uses aic79xx driver Adaptec 29160 uses aic7xxx driver
I took the 29160 out of the PC, intalled it in the Opteron box (into a 64-bit PCI-X slot) and repeated the tests. Failure. The Opteron has 4 kernels available: 2.6.10-1.770_FC3smp 2.6.10-1.770_FC3smp 2.6.10-1.770_FC3 2.6.10-1.770_FC3 I tried them all. Failure. I even booted from a Knoppix CD with 2.4.27. (This presumably means I'm running a 32-bit kernel on a 64-bit box). Failure.
I remembered I had a desktop Compaq SDLT1 tapedrive on one of my systems. I tried that on the 29160 adapter. Success. I tried it on the 39320 adapter. Success. Is it some sort of datarate problem I ask myself (the SDLT1 is about half the speed of the SDL2) and the SDLT2 worked on the PC. I moved the 2960 out of the PCI-X slot into a PCI slot and tried again with the 5 Kernels listed above. Failure.
I'm now running out of ideas/energy/hardware. With my original configuration, (Opteron, 2.6.10-1.770_FC3smp, Adaptec 39320, dual SDLT2) I can get verified tar dumps of at least 4GB (I havent tried anything bigger) provided I use a record size > 32768 (64*512). 65536 (256*512) works fine, as does 131072 (256*512). 32768 fails.
Is it time to file a bug report? Who with?
Thanks for any and all suggestions,
Terry.
T. Horsnell wrote:
I found A PC running FC3 (kernel 2.6.10-1.741_FC3) which had an Adaptec 29160 SCSI adapter in a 32-bit PCI slot. I connected my SDLT2 library to it and repeated my tests. Everything worked!.
Adaptec 39320 uses aic79xx driver Adaptec 29160 uses aic7xxx driver
I took the 29160 out of the PC, intalled it in the Opteron box (into a 64-bit PCI-X slot) and repeated the tests. Failure. The Opteron has 4 kernels available: 2.6.10-1.770_FC3smp 2.6.10-1.770_FC3smp 2.6.10-1.770_FC3 2.6.10-1.770_FC3 I tried them all. Failure. I even booted from a Knoppix CD with 2.4.27. (This presumably means I'm running a 32-bit kernel on a 64-bit box). Failure.
I remembered I had a desktop Compaq SDLT1 tapedrive on one of my systems. I tried that on the 29160 adapter. Success. I tried it on the 39320 adapter. Success. Is it some sort of datarate problem I ask myself (the SDLT1 is about half the speed of the SDL2) and the SDLT2 worked on the PC. I moved the 2960 out of the PCI-X slot into a PCI slot and tried again with the 5 Kernels listed above. Failure.
I'm now running out of ideas/energy/hardware. With my original configuration, (Opteron, 2.6.10-1.770_FC3smp, Adaptec 39320, dual SDLT2) I can get verified tar dumps of at least 4GB (I havent tried anything bigger) provided I use a record size > 32768 (64*512). 65536 (256*512) works fine, as does 131072 (256*512). 32768 fails.
Is it time to file a bug report? Who with?
Thanks for any and all suggestions,
This is sounding a lot like a problem I ran into with a DDS-2 tape drive (Archive IBM4326NP/RP). When the drive's internal counter of bytes written to the tape (compressed data) reached 4 GiB the drive would (a) skip writing one block of data, and (b) fail to log any further errors (e.g., write retries). The byte counter would remain stuck at the 4 GiB mark. No error was ever reported by the drive. This was using an Adaptec 2940 SCSI adapter and the aic7xxx driver. I'm reasonably sure that neither the SCSI adapter nor the driver was at fault because they see only the uncompressed data stream and have no way to know when the drive's internal counter hits 4 GiB. In addition, the same adapter and driver work just fine on a DDS-3 tape drive.
I first realized there was a problem when I noticed that the drive's block counter (shown by "mt tell") sometimes incremented by one less than the number of blocks reported by 'dd'. After much testing and looking at the counters reported by "smartctl -l error" I was able to identify the problem. Though the advertised capacity of a DDS-2 tape is 4 GB compressed, the counters do not get reset when changing tapes, so it's possible to encounter the error at any point on the tape.
T. Horsnell wrote:
I found A PC running FC3 (kernel 2.6.10-1.741_FC3) which had an Adaptec 29160 SCSI adapter in a 32-bit PCI slot. I connected my SDLT2 library to it and repeated my tests. Everything worked!.
Adaptec 39320 uses aic79xx driver Adaptec 29160 uses aic7xxx driver
I took the 29160 out of the PC, intalled it in the Opteron box (into a 64-bit PCI-X slot) and repeated the tests. Failure. The Opteron has 4 kernels available: 2.6.10-1.770_FC3smp 2.6.10-1.770_FC3smp 2.6.10-1.770_FC3 2.6.10-1.770_FC3 I tried them all. Failure. I even booted from a Knoppix CD with 2.4.27. (This presumably means I'm running a 32-bit kernel on a 64-bit box). Failure.
I remembered I had a desktop Compaq SDLT1 tapedrive on one of my systems. I tried that on the 29160 adapter. Success. I tried it on the 39320 adapter. Success. Is it some sort of datarate problem I ask myself (the SDLT1 is about half the speed of the SDL2) and the SDLT2 worked on the PC. I moved the 2960 out of the PCI-X slot into a PCI slot and tried again with the 5 Kernels listed above. Failure.
I'm now running out of ideas/energy/hardware. With my original configuration, (Opteron, 2.6.10-1.770_FC3smp, Adaptec 39320, dual SDLT2) I can get verified tar dumps of at least 4GB (I havent tried anything bigger) provided I use a record size > 32768 (64*512). 65536 (256*512) works fine, as does 131072 (256*512). 32768 fails.
Is it time to file a bug report? Who with?
Thanks for any and all suggestions,
This is sounding a lot like a problem I ran into with a DDS-2 tape drive (Archive IBM4326NP/RP). When the drive's internal counter of bytes written to the tape (compressed data) reached 4 GiB the drive would (a) skip writing one block of data, and (b) fail to log any further errors (e.g., write retries). The byte counter would remain stuck at the 4 GiB mark. No error was ever reported by the drive. This was using an Adaptec 2940 SCSI adapter and the aic7xxx driver. I'm reasonably sure that neither the SCSI adapter nor the driver was at fault because they see only the uncompressed data stream and have no way to know when the drive's internal counter hits 4 GiB. In addition, the same adapter and driver work just fine on a DDS-3 tape drive.
I first realized there was a problem when I noticed that the drive's block counter (shown by "mt tell") sometimes incremented by one less than the number of blocks reported by 'dd'. After much testing and looking at the counters reported by "smartctl -l error" I was able to identify the problem. Though the advertised capacity of a DDS-2 tape is 4 GB compressed, the counters do not get reset when changing tapes, so it's possible to encounter the error at any point on the tape.
Thanks for this info Bob. I dont think this applies in my case as my problems occur when writing files if the recordsize is small-ish (e.g. 10240 bytes) and even happens on files 500Kbytes long, but I can successfully write files, even large files (tar of 4.7GB) if I use a record size of 65536 and upward. The symptom is that, like yourself, records are lost with no errors logged.
e.g. [root@ls1 ~]$ dd if=tapetest500K of=/dev/st1 bs=10240 ; dd if=/dev/st1 of=junk bs=10240 ; cmp tapetest 500K junk 48+1 records in 48+1 records out 36+1 records in 36+1 records out tapetest500K junk differ: byte 215041, line 854
tapetest500K is full of random numbers, and in this and all other tests the mismatches always start on a record-boundary, which makes me think that whole records are being dropped. If I read the tapes on my Alpha box (the SDLT2 drives work OK in this box) I get the same read mismatches, so I assume its a write problem. I guess I should write something to try and find out which records they are.
My current suspicion is that its something to do with the use of the buffer on the tapedrive itself, and the SCSI driver. Some sort of timing issue maybe. I'm about to contact Quantum to try and get some more info about this.
Cheers, Terry.
-- Bob Nichols rnichols42@comcast.net
-- fedora-list mailing list fedora-list@redhat.com To unsubscribe: http://www.redhat.com/mailman/listinfo/fedora-list