How to find a needle in a haystack?

Wed May 19 18:28:46 UTC 2010

>> I have a backup server that contains 10 ext3 file systems each with 12
>> million files scattered randomly over 4000 directories.  The files
>> average
>
> OK 4000 directories for 12million files means you've got 3000 files per
> directory. If you are doing that make sure your fs has htree enabled. The
> initial find is still going to suck especially if on spinning media and
> doubly (actually far more than doubly) if the disk is also busy trying to
> do other work.

dir_index is enabled as is noatime.  I'm not sure but I believe the
noatime was supposed to speed things up a little also.

>> size is 1MB.  Every day I expect to get 20 or so requests for files from
>> this archive.  The files were not stored in any logical structure that I
>> can use to narrow down the search.  This will be different moving
>> forward
>> but it does not help me for the old data.  Additionally, every day data
>> is
>> added and old data is removed to make space.
>
> What are you searching by - name or content ?
>
> If you are searching by content then you really want to stuff the lot
> into a freetext engine like Omega. If you are searching by file name I'd
> take the time to turn the archive upside down and archive that way up
> assuming your tools can do it.
>
> That is turn all the files backwards so
>
> 	foo/bar/hello.c
>
> becomes
>
> 	hello.c/bar/foo
>
> or build a symlink farm of them that way up (so hello.c/bar/foo is a
> symlink to foo/bar/hello.c)
>
> Its then suddenely a lot lot less painful to find things ! A database
> would also no doubt do the job

I am searching by name only.  I am a little unclear about this.  The idea
here is to create a symlink for each file in the root of the filesystem? 
That won't hurt something?

---
Will Y.

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.