How to find a needle in a haystack?
aragonx at dcsnow.com
aragonx at dcsnow.com
Wed May 19 18:28:46 UTC 2010
>> I have a backup server that contains 10 ext3 file systems each with 12
>> million files scattered randomly over 4000 directories. The files
>> average
>
> OK 4000 directories for 12million files means you've got 3000 files per
> directory. If you are doing that make sure your fs has htree enabled. The
> initial find is still going to suck especially if on spinning media and
> doubly (actually far more than doubly) if the disk is also busy trying to
> do other work.
dir_index is enabled as is noatime. I'm not sure but I believe the
noatime was supposed to speed things up a little also.
>> size is 1MB. Every day I expect to get 20 or so requests for files from
>> this archive. The files were not stored in any logical structure that I
>> can use to narrow down the search. This will be different moving
>> forward
>> but it does not help me for the old data. Additionally, every day data
>> is
>> added and old data is removed to make space.
>
> What are you searching by - name or content ?
>
> If you are searching by content then you really want to stuff the lot
> into a freetext engine like Omega. If you are searching by file name I'd
> take the time to turn the archive upside down and archive that way up
> assuming your tools can do it.
>
> That is turn all the files backwards so
>
> foo/bar/hello.c
>
> becomes
>
> hello.c/bar/foo
>
> or build a symlink farm of them that way up (so hello.c/bar/foo is a
> symlink to foo/bar/hello.c)
>
> Its then suddenely a lot lot less painful to find things ! A database
> would also no doubt do the job
I am searching by name only. I am a little unclear about this. The idea
here is to create a symlink for each file in the root of the filesystem?
That won't hurt something?
---
Will Y.
--
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.
More information about the users
mailing list