dd question

Cameron Simpson cs at zip.com.au
Wed Dec 15 01:06:48 UTC 2010


On 10Dec2010 14:28, stan <gryt2 at q.com> wrote:
| On Fri, 10 Dec 2010 03:11:25 +0000 (UTC)
| "Amadeus W.M." <amadeus84 at verizon.net> wrote:
| > I have a binary file with data. Each block of 48 bytes is a record. I 
| > want to extract the first 8 bytes within each record. I'm thinking
| > this should be possible with dd, but gawk, perl - anything goes. It
| > just has to be fast, because the data files are ~ 1Gb.
| > 
| > I can do this in C++ but I was just wondering if it can be done with 
| > existing well tested tools.
| 
| The binary aspect makes it tricky.  If they were EOL delimited records,
| lots of tools could do this.
| 
| Here's a python function, not checked though.  It does require that you
| have enough memory to slurp the file into memory.  Put it in a file,
| edit for the filenames, and run it as python <filename>.  I guess it
| should take less than a minute, but not sure, should be fine for one
| off.
| 
| def extract (filename1 = None, filename2 = None):
|   if filename1 != None and filename2 != None:

I'd not bother with this check - it is a special purpose function that
will not be misused, and if is _is_ misused it will fail silently, which
is not good.

|     infile = open (filename1, "rb")
|     slurp = infile.read ()  # at least as much memory as the file size
|     infile.close ()
|     outfile = open (filename2, "wb")
|     while len (slurp) > 0:
|       record = slurp [:48]  # extract a record
|       first8 = record [:8]  # slice off first 8 positions
|       outfile.write (first8)  # write them out, no separator
|       slurp = slurp [48:]  # chop them off the file

This step is Very Expensive. Don't reallocate a 1GB string every 48
bytes, just pull out the pieces you need.

|     outfile.close ()
|     
| extract (filename1 = "your input filename with path", 
|          filename2 = "your output filename with path")

Untested example:

  def get8of48(fp):
    while True:
      chunk = fp.read(48)
      if len(chunk) == 0:
        break
      yield chunk[:8]
      if (len(chunk) != 48:
        print >>sys.stderr, "warning: short read from %s (%d bytes)" % (fp, len(chunk))

  for chunk8 in get8of48(open("your filename here", "rb")):
    ... do something with chunk8, the 8-byte chunk ...

Shorter and faster and using less memory.

Cheers,
-- 
Cameron Simpson <cs at zip.com.au> DoD#743
http://www.cskk.ezoshosting.com/cs/

The general consensus on covered [litter] boxes is that they are a Good
Thing, and having bought one myself now and tried it out for a couple of
weeks, I agree. No more litter sprayed halfway across the city.
        - krw at prg.ox.ac.uk (Kenneth Wood)


More information about the users mailing list