size discrepancy after tarring a dir

Cameron Simpson cs at zip.com.au
Fri Sep 3 23:40:51 UTC 2010


On 03Sep2010 14:21, JD <jd1008 at gmail.com> wrote:
|   I have two mounted disks,  both ext3 mounted
| as
| /sdb1
| /sdc1
| 
| On /sdb1 I have a directory, let's call it dirx.
| 
| 1. rm -rf /sdc1/dirx
| 
| 2. cd /sdb1
| 3. tar cf - dirx | tar -C /sdc1 -xpf -
| 
| Neither dir (/sdb1 and /sdc1) are not accessed by any programs other
| than the tar program (and of course /sdb1 is the shell's CWD).
| The shell's history file is in my home dir.
| 
| After tar:
| 
| 4. du -sk dirx  /sdc1/dirx
| 2904536    /sdc1/dirx
| 2802124    dirx
| 
| So, why this size inflation by 104MiB ?
| 
| I repeated the process twice. Same difference.
| 
| Other dirs tarred in this way from sdb1 to sdc1 do not show this 
| discrepancy.

There are two possible sources of discrepancies that I can think of:
  - different filesystem types
  - different directory packing
  - file fragmentation

I presume we can discount the first one.

Directory packing normally is _better_ in a new directory; older directories
can accumulate holes from file deletions. So the second one seems unlikely too.
The way to check is to walk the trees with find and tally sizes with
awk:

  find /sdb1/dirx -type d -ls | awk '{sum += $7} END { print sum }'
  find /sdc1/dirx -type d -ls | awk '{sum += $7} END { print sum }'

The size difference seems to large for this anyway.

That leaves file fragmentation. Does sdc1 have a lot of other data?
Maybe complete MP3s won't fit into the gaps, and must be broken up more.
Again, like new directories, there is normally less fragmentation in
copied files, not more. And MP3s tend to be written in one go anyway, so
the source files are probablem not fragmented either.

None of these choices seem likely to me.

There is a final option which should not apply because these are different
fileystems and also because your files are definitely copies: hard link
counting. du notices hard links and correctly does not count the second
name twice. If you do this:

  du -sk dir1 dir2

and dir1 and dir2 have some files hard linked between them then du will
not count the hardlinked files when it encounters them, and you would
then see "dir2" have a lower count than you might expect otherwise.

The way to check this one is to run two dus:

  du -sk dir1
  du -sk dir2

You can also scour your tree for hard links:

  find /sdc1/dirx -type f -nlink +1 -ls

though your tar copy should preserve the hard linking in your copy, and
thus not change the totals.

In short, several things are listed above that can produce different "on
disc" sizes for copied data, and I don't really think any of them
explain your results. But do some of the checks I suggest - if nothing
else they may reveal more clues.

| Dirx contains mp3's.

"MP3s", please. There are no apostrophes in plurals!

Cheers,
-- 
Cameron Simpson <cs at zip.com.au> DoD#743
http://www.cskk.ezoshosting.com/cs/

I bested him in an Open Season of scouring-people's-postings-looking-for-
spelling-errors.        - kevin at rotag.mi.org (Kevin Darcy)


More information about the users mailing list