How to find a needle in a haystack?

Wed May 19 18:07:14 UTC 2010

> On Tue, 2010-05-18 at 16:49 -0400, aragonx at dcsnow.com wrote:
>> Hello all,
>>
>> I need some ideas.
>>
>> I have a backup server that contains 10 ext3 file systems each with 12
>> million files scattered randomly over 4000 directories.  The files
>> average
>> size is 1MB.
>
> So each filesystem is about 12*10^6 * 1MB = 12*10^12 or 12 terabytes?

Each filesystem is 2.5TB so the average file size must be much smaller. 
At last count, one of the filesystems contained 20 million files.

> You don't say what the file contents are like, e.g. text, structured
> data, unstructured binary, etc, nor do you say how you match the file
> you want (e.g. is it equivalent to a text substring, a regular
> expression, or what?). Knowing what the contents look like would help to
> evaluate if it's worth e.g. generating a hash for subsections of the
> file when it's being stored. Alternatively, it could conceivably make
> sense to search for strings in the raw disk and work backwards to
> calculate what files they belong to, who knows?

The data in the files is of the unstructured binary type.  When I do a
search, I have _most_ of the file name.  Enough to uniquely identify it.

I hope that helps.

---
Will Y.

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.