Hi.
Got a bit of a question/issue that I'm trying to resolve. I'm asking this of a few groups so bear with me.
I'm considering a situation where I have multiple processes running, and each process is going to access a number of files in a dir. Each process accesses a unique group of files, and then writes the group of files to another dir. I can easily handle this by using a form of locking, where I have the processes lock/read a file and only access the group of files in the dir based on the open/free status of the lockfile.
However, the issue with the approach is that it's somewhat synchronous. I'm looking for something that might be more asynchronous/parallel, in that I'd like to have multiple processes each access a unique group of files from the given dir as fast as possible.
So.. Any thoughts/pointers/comments would be greatly appreciated. Any pointers to academic research, etc.. would be useful.
thanks
On Sat, Feb 28, 2009 at 21:47:39 -0800, bruce bedouglas@earthlink.net wrote:
However, the issue with the approach is that it's somewhat synchronous. I'm looking for something that might be more asynchronous/parallel, in that I'd like to have multiple processes each access a unique group of files from the given dir as fast as possible.
If each process is accessing a unique group of files why do you need locking? Can you explain a bit more about what is going on?
Bruno Wolff III wrote:
On Sat, Feb 28, 2009 at 21:47:39 -0800, bruce bedouglas@earthlink.net wrote:
However, the issue with the approach is that it's somewhat synchronous. I'm looking for something that might be more asynchronous/parallel, in that I'd like to have multiple processes each access a unique group of files from the given dir as fast as possible.
If each process is accessing a unique group of files why do you need locking? Can you explain a bit more about what is going on?
Yes, it seems like you're being deliberately vague. It's understandable if you can't disclose what you're working on for some reason. But in that case, don't expect useful help.
Matt Flaschen
On Sat, 28 Feb 2009 21:47:39 -0800 bruce wrote:
However, the issue with the approach is that it's somewhat synchronous. I'm looking for something that might be more asynchronous/parallel, in that I'd like to have multiple processes each access a unique group of files from the given dir as fast as possible.
Then just do that. You don't need to do anything special to have process A access files B and C, and process D access files E and F. Coordination is irrelevant for this task. Why would you think otherwise?
If you need to use the same file for two or more processes, each of which will take a while to run, then create a new copy of the file if the copying process will take less time than the task that you intend to run on the file, and then you can still access a separate file for each process.
Depending on file sizes and the nature of the job, it may be useful to create a ramdisk to hold the temporary datafile(s), if any. Or just write the task to load the data into memory on startup and work with it from there.
hi bruno.
for my situation. i have a bunch of files being created by an upfront process, and on the backend, i have a number of client/child processes that get created, which have to operate/process the files. no file is processed by more than a single client app.
i'm considering a web kind of app, where the webservice returns a group of files back to the requesting query. the webservice would have to quickly generate the list of files from the local dir, which is why i'm trying to figure out a "fast/efficient" approach for this scenario.
using any kind of file locking process requires that i essentially have a gatekeeper, allowing a single process to enter, access the files at a time...
i can easily setup a file read/write lock process where a client app gets/locks a file, and then copies/moves the required files from the initial dir to a tmp dir. after the move/copy, the lock is released, and the client can go ahead and do whatever with the files in the tmp dir.. thie process allows multiple clients to operate in a psuedo parallel manner...
i'm trying to figure out if there's a much better/faster approach that might be available.. which is where the academic/research issue was raised..
the issue that i'm looking at is analogous to a FIFO, where i have lots of files being shoved in a dir from different processes.. on the other end, i want to allow mutiple client processes to access unique groups of these files as fast as possible.. access being fetch/gather/process/delete the files. each file is only handled by a single client process.
thanks..
-----Original Message----- From: Bruno Wolff III [mailto:bruno@wolff.to] Sent: Saturday, February 28, 2009 10:05 PM To: bruce Cc: 'Community assistance, encouragement, and advice for using Fedora.' Subject: Re: file locking...
On Sat, Feb 28, 2009 at 21:47:39 -0800, bruce bedouglas@earthlink.net wrote:
However, the issue with the approach is that it's somewhat synchronous.
I'm
looking for something that might be more asynchronous/parallel, in that
I'd
like to have multiple processes each access a unique group of files from
the
given dir as fast as possible.
If each process is accessing a unique group of files why do you need locking? Can you explain a bit more about what is going on?
On Sun, 01 Mar 2009 09:54:59 -0800 bruce wrote:
i'm considering a web kind of app, where the webservice returns a group of files back to the requesting query. the webservice would have to quickly generate the list of files from the local dir, which is why i'm trying to figure out a "fast/efficient" approach for this scenario.
If I understand what you're trying to do, you need to use some sort of an "I'm done" flag. Have each process set a flag when it's done, and when all of the flags from each process appear, then gather the files and do whatever you're doing.
On Sun, Mar 01, 2009 at 09:54:59AM -0800, bruce wrote:
hi bruno.
for my situation. i have a bunch of files being created by an upfront process, and on the backend, i have a number of client/child processes that get created, which have to operate/process the files. no file is processed by more than a single client app.
....
i can easily setup a file read/write lock process where a client app gets/locks a file, and then copies/moves the required files from the initial dir to a tmp dir. after the move/copy, the lock is released, and the client can go ahead and do whatever with the files in the tmp dir.. the process allows multiple clients to operate in a pseudo parallel manner...
i'm trying to figure out if there's a much better/faster approach that might be available.. which is where the academic/research issue was raised..
the issue that i'm looking at is analogous to a FIFO, where i have lots of files being shoved in a dir from different processes.. on the other end, i want to allow mutiple client processes to access unique groups of these files as fast as possible.. access being fetch/gather/process/delete the files. each file is only handled by a single client process.
You should benchmark some strategies.
Standard flock(), lockf(), stat() and fcntl() and friends should do the trick..... You do not want to do a copy. Locking over NFS is 'interesting' so look for a 'lock free' strategy if you expect this to run on NFS file systems now or in the future.
My simple minded solution is to have the creation process or a dispatcher process move files from the front end input directory into a modest set of directories for the back end processing to pick from. The number of back end processes and dirs can be tuned to match the number of processors and IO subsystem performance.
Renaming a file on the same device is a "quick" meta data transaction that does not risk data loss. The renaming can also be done by the input process at the point that the file creation is finished. It may be necessary to do this step to ensure that a tail end process does not open a file before it is ready for processing.
Depending on the complexity of the activity you may need to resort to some of the tricks used by sendmail or postfix. For example how do you know that a data file has been processed: not at all, completely or incompletely and if it matter should it get processed twice.
Do consider the use of mmap() as it can help you limit the pollution of the page cache by a one time read activity. With mmap revisit the NFS topic.
The number of files in any one directory can be important. Establish some design limits. Too many files in any one dir can bog down the file system. Also give consideration to the names of the files. Stuff like sorting file names can be important i.e. 2009.28.07 .vs. 2009.07.28 .vs. 09.7.28 .vs. 09.11.11. Also dots and dashes (".","-") are shell or regexp meta character so this might be a better list to start. i.e. 2009_28_07 .vs. 2009_07_28 .vs. 09_7_28 .vs. 09_11_11. Letters and hex may play well... but design for backup, easy system admin and general simplicity including documentation (documentation important characters might be > .vs. > for html documents, see also TeX, LaTeX, SGML, XML).
Predictable file names have been central to some security risks so depending on the data value/ risk issues some attention may need to be given to the creation steps.
On Sun, 2009-03-01 at 11:47 -0800, Nifty Fedora Mitch wrote:
Depending on the complexity of the activity you may need to resort to some of the tricks used by sendmail or postfix. For example how do you know that a data file has been processed: not at all, completely or incompletely and if it matter should it get processed twice.
Maildir uses multiple directories and links mail messages between them because hard-linking is NFS-safe. Something along these lines might be suitable in this case.
poc