How to find a needle in a haystack?

Tue May 18 21:44:38 UTC 2010

> I have a backup server that contains 10 ext3 file systems each with 12
> million files scattered randomly over 4000 directories.  The files average

OK 4000 directories for 12million files means you've got 3000 files per
directory. If you are doing that make sure your fs has htree enabled. The
initial find is still going to suck especially if on spinning media and
doubly (actually far more than doubly) if the disk is also busy trying to
do other work.

> size is 1MB.  Every day I expect to get 20 or so requests for files from
> this archive.  The files were not stored in any logical structure that I
> can use to narrow down the search.  This will be different moving forward
> but it does not help me for the old data.  Additionally, every day data is
> added and old data is removed to make space.

What are you searching by - name or content ?

If you are searching by content then you really want to stuff the lot
into a freetext engine like Omega. If you are searching by file name I'd
take the time to turn the archive upside down and archive that way up
assuming your tools can do it.

That is turn all the files backwards so

	foo/bar/hello.c

becomes

	hello.c/bar/foo

or build a symlink farm of them that way up (so hello.c/bar/foo is a
symlink to foo/bar/hello.c)

Its then suddenely a lot lot less painful to find things ! A database
would also no doubt do the job