Hi,
I'm reposting this to bring attention to these patches again. The
complete set of these patches has survive a nunc-stans stress test, and
the 24hour connection reliability test in directory server.
https://fedorahosted.org/nunc-stans/ticket/56https://fedorahosted.org/nunc-stans/attachment/ticket/56/0002-Ticket-56-und…
On fedora rawhide this causes the build to fail. We aren't changing the
malloc impl, and we are allowing the over-ride to malloc via the thread
pool malloc callbacks, so this check is not so important.
https://fedorahosted.org/nunc-stans/ticket/51https://fedorahosted.org/nunc-stans/attachment/ticket/51/0001-Ticket-51-Job…
When a job is marked persistent, this means that after the event has
occurred, and we dispatch work, the job itself stays armed. Armed
meaning "accepting new work". In this case it is still registered in the
event framework, and will "just work" to accept new work.
However, if you rearm this job, it may be placed into the event_q again,
and the event framework may or may not accept the insertion of the
duplicated fd for events. At best, the event framework ignores it. At
moderate, we trigger two events for one input. At worst, we explode.
We should ignore rearm on persistent jobs as it is not needed! they stay
armed.
https://fedorahosted.org/nunc-stans/ticket/52https://fedorahosted.org/nunc-stans/attachment/ticket/52/0004-Ticket-52-ns_…
When we call ns_job_modify, IE to change a READ job to a WRITE job, it
automatically rearms the job. Given that a job may have other
alterations performed before or after the ns_job_modify, this causes
issues. When the job is armed, it's moved to the queue through a set of
lock free steps. Once queued, if you change the job, there is NO
GUARANTEE about the atomicity of the change you make, whether it will
see another thread or not. Worst, it will only partially be seen.
It's certainly possible to cause an issue where a job_modify is run,
then another thread dequeues the work, and begins to work on it, while
the original thread is still changing it! Subtle race conditions.
This behaviour is surprising. The function should do one thing, and one
thing well: modifying the type.
https://fedorahosted.org/nunc-stans/ticket/57https://fedorahosted.org/nunc-stans/attachment/ticket/57/0001-Ticket-57-Imp…
Given the discussed issues above, to prevent and detect issues we must
enforce that jobs are in certain states so that we can act upon them
with reasonable behaviours.
A job is created as WAITING. In the WAITING state, the job is NOT in a
queue, and can be modified freely.
When the job is rearmed, it moves to ARMED. If you attempt to modify the
job, you will be denied. Only when the job is triggered IE dequeued, it
moves to WAITING and the callback trigged. The job may now be modified
and rearmed.
A job that is ARMED cannot be re-armed.
A job that is persistent, never leaves the ARMED state, IE cannot be
modified.
A job that is ARMED cannot be deleted unless the server is shutting
down. Imagine if you free-ed a job as it was being dequeued and callback
triggering. Segfault risk.
A job that is WAITING can me sent to done, where it will be marked as
NEEDS_DELETE. After wards, it will be moved to DELETED and freed.
This allows us to prevent double frees (marking a job done twice), use
after free (arming a done job), race conditions (modifying armed jobs),
and many more. It gives us a solid base to assert and reason about the
correctness of jobs.
In the future, I will need to find a way to make it so that ARMED jobs
*can* be deleted, so that we can delete persistent jobs during run time.
The issue here is the way that lfds works with regards to barries of the
memory, so I will need to study this further.
I hope this makes the review process clearer, and the intent of these
changes better.
--
Sincerely,
William Brown
Software Engineer
Red Hat, Brisbane