Old message that should have been on the mailing list

-------- Original Message --------
Subject: Re: kernel.spec and scyld patch - for 2.6.32-220.4.1
Date: Tue, 31 Jan 2012 09:52:38 -0500
From: Adam Young <ayoung@redhat.com>
To: John Hawkes <jhawkes@penguincomputing.com>


On 01/31/2012 01:01 AM, John Hawkes wrote:
> Well, I did consider adding the bproc stub into entry_64.S. It just
> seemed easier to get things to work, first, and then fiddle with that
> kind of thing second. Then it would be more obvious that I screwed
> things up. And I hadn't thought out exactly how I wanted to do the
> 'hook' into the bproc module. The entry_64.S stub_bproc would call
> something bproc inside the main kernel, and that called routine would
> likely be the thing that contains the source code hook to the  proc
> module.
>
> I continue to hassle with the startup of a new child thread on the
> slave.  I'm not convinced that the filecache is working right. I'm not
> convinced of much of anything, actually. I'm really frustrated.
> Tomorrow is a new day! I think I'll spend effort to instrument a rhel5
> filecache and see if my rhel6 filecache is doing similar reads.

One way you can remove filecache from the mix is by  using two virtual 
machines with a complete install on each.  Disable file_cache and  when 
you start up a new process on a compute node,  it will just use the 
local copy of libraries.  Don't use any of the scyld stuff to 
provision,  just inject the kernel module into each side by hand,  one 
as master,   the other as slave,  then run bpmaster and bpslave  from 
the command line.


I am assuming you are developing on Centos 6.  There should be an 
application named virt_maanger (you might need to yum install it)  that 
will let you create a new VM.  Use a Centos6 iso as the install image,  
and boot the vm.  Once it is up and running and through first boot,  
shut it down,  and use the clone option from virt manager.  Then you 
will have two vms that are identical except for the MAC address.  It 
takes a little bit of CPU to work this way.  For what you a are doing,  
one CPU per VM should be plenty.

>
> -- John
>
>
> On Jan 30, 2012, at 5:59 PM, Adam Young<ayoung@redhat.com>  wrote:
>
>> On 01/30/2012 04:54 PM, John Hawkes wrote:
>>> I believe you're looking at the RHEL 4&5 bproc stub, not the RHEL 6, which is:
>>>
>>> #elif RHEL_MAJOR == 6
>>> __asm__(
>>> "       .text                      \n"
>>> "       .globl do_bproc_stub       \n"
>>> "do_bproc_stub:                    \n"
>>> "       sub     $0x30,%rsp         \n"
>>> "       callq   save_rest          \n"
>>> "       leaq    0x8(%rsp),%rcx     \n"
>>> "       callq   do_bproc           \n"
>>> "       jmp     ptregscall_common  \n"
>>> );
>>>
>>>
>>> The disassembly of stub_sigaltstack (which sets up for 2 args, plus
>>> ptregs, vs. bproc which wants 3 args + ptregs):
>>>      sub    $0x30,%rsp
>>>      callq  ffffffff8100af70<save_rest>
>>>      lea    0x8(%rsp),%rdx
>>>      callq  ffffffff8100ab20<sys_sigaltstack>
>>>      jmpq   ffffffff8100b4a0<ptregscall_common>
>>>      nopl   0x0(%rax,%rax,1)
>>>
>>> So I was presuming that if 2 args uses %rdx, then 3 args uses %rcx.'
>> Yes, I was.  That looks a lot more correct.   The PTREGS call puts the register to use for the ptregs into the arg param  which is used for   leaq 8(%rsp), \arg    /* pt_regs pointer */
>>
>> do_fork is defined like this:
>>
>>
>> And     PTREGSCALL stub_fork, sys_fork, %rdi
>>
>> The order of registers used in function calls should be:
>>
>> RDI, RSI, RDX, RCX, R8 and R9
>>
>> And
>>
>> include/asm/syscalls.h:24:int sys_fork(struct pt_regs *);
>>
>> clone also seems to line up.
>>
>> PTREGSCALL stub_clone, sys_clone, %r8
>> long
>> sys_clone(unsigned long clone_flags, unsigned long newsp,
>>       void __user *parent_tid, void __user *child_tid, struct pt_regs *regs)
>>
>>
>> I'd almost say that you should move the code that does this kind of magic into entry_64.S  and such,  so you can make use of the real  Kernel Macros.  Copying it is problematic,  and won't help it get into the Mainline anyway,  and there is no reason not to do it the right way now, is there?
>>
>>
>>
>>> John
>>>
>>> On Mon, Jan 30, 2012 at 12:42 PM, Adam Young<ayoung@redhat.com>   wrote:
>>>> According to the top of tree Kernel,  ptregs call (in entry_64.S,  not
>>>> entry.S anymore is:
>>>>
>>>>     PARTIAL_FRAME 1 8        /* offset 8: return address */
>>>>     subq $REST_SKIP, %rsp
>>>>     CFI_ADJUST_CFA_OFFSET REST_SKIP
>>>>     call save_rest
>>>>     DEFAULT_FRAME 0 8        /* offset 8: return address */
>>>>     leaq 8(%rsp), \arg    /* pt_regs pointer */
>>>>     call \func
>>>>     jmp ptregscall_common
>>>>     CFI_ENDPROC
>>>>
>>>>
>>>> Which leads me to think that sys_bproc should have something done before the
>>>> call.
>>>>
>>>>
>>>> You Have:
>>>>         "       leaq    do_bproc(%rip),%rax         \n"
>>>>         "       leaq    -"__stringify(ARGOFFSET)"+8(%rsp), %rcx    \n"
>>>>         "       jmp     ptregscall_common           \
>>>>
>>>>
>>>> But I'd expect you need something more like:
>>>>
>>>>     PARTIAL_FRAME 1 8        /* offset 8: return address */
>>>>     subq $REST_SKIP, %rsp
>>>>     CFI_ADJUST_CFA_OFFSET REST_SKIP
>>>>     call save_rest
>>>>     DEFAULT_FRAME 0 8        /* offset 8: return address */
>>>>     leaq 8(%rsp), \arg    /* pt_regs pointer */
>>>>     call \do_bproc(%rip),%rax
>>>>     jmp ptregscall_common
>>>>     CFI_ENDPROC
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On 01/27/2012 03:47 PM, John Hawkes wrote:
>>>>> I'm thinking that my problem is that I'm not handling the pt_regs (et
>>>>> al) correctly, so when the bproc kernel module code returns to user
>>>>> state, things aren't right.  I'm attaching the kernel/sysdep_x86_64.c
>>>>> that I'm currently using, which has both the RHEL 4&5 code and the new
>>>>> RHEL 6 code (managed by #defines).  You can see I have separate
>>>>> versions of bproc_kernel_thread, etc.
>>>>>
>>>>> John
>>>>>
>>>>> On Tue, Jan 24, 2012 at 11:56 AM, Adam Young<ayoung@redhat.com>     wrote:
>>>>>> On 01/24/2012 02:25 PM, John Hawkes wrote:
>>>>>>> I'm attaching a tarball of rpms that should get you started.  The
>>>>>>> bproc is verbose with printk and syslog debugging messages.
>>>>>>> The md5sum for the tarball:
>>>>>>>     842e760a165d59a144425043e09bd888
>>>>>>>
>>>>>>> john
>>>>>> Thanks.  I actually started a git repo based on the Linus tree and
>>>>>> started
>>>>>> applying it.  But I will take this and run with it as well.  I tried to
>>>>>> build the Kernel in a VM,  but ran out of disk space,  so' I'll need to
>>>>>> expand it later.
>>>>>>
>>>>>> I talked quickly with Kyle earlier today.  He said that the biggest
>>>>>> difference between the RHEL 5 and RHEL6 Kernel was the removal of PT
>>>>>> regs.
>>>>>>   This is probably no surprise to you,  but it was to me.  This commit is
>>>>>> probably the best explanation for it, or at least a good starting point.
>>>>>>   It
>>>>>> is old enough that I can't find a pointer to it on gitweb,  so I included
>>>>>> the commit message en toto.
>>>>>>
>>>>>> commit 7d12e780e003f93433d49ce78cfedf4b4c52adc5
>>>>>> Author: David Howells<dhowells@redhat.com>
>>>>>> Date:   Thu Oct 5 14:55:46 2006 +0100
>>>>>>
>>>>>>
>>>>>> http://git.kernel.org/?p=linux%2Fkernel%2Fgit%2Ftorvalds%2Flinux.git&a=search&h=HEAD&st=commit&s=7d12e780e003f93433d49ce78cfedf4b4c52adc5
>>>>>>
>>>>>>
>>>>>>     IRQ: Maintain regs pointer globally rather than passing to IRQ
>>>>>> handlers
>>>>>>
>>>>>>     Maintain a per-CPU global "struct pt_regs *" variable which can be
>>>>>> used
>>>>>> instead
>>>>>>     of passing regs around manually through all ~1800 interrupt handlers
>>>>>> in
>>>>>> the
>>>>>>     Linux kernel.
>>>>>>
>>>>>>     The regs pointer is used in few places, but it potentially costs both
>>>>>> stack
>>>>>>     space and code to pass it around.  On the FRV arch, removing the regs
>>>>>> parameter
>>>>>>     from all the genirq function results in a 20% speed up of the IRQ exit
>>>>>> path
>>>>>>     (ie: from leaving timer_interrupt() to leaving do_IRQ()).
>>>>>>
>>>>>>     Where appropriate, an arch may override the generic storage facility
>>>>>> and
>>>>>> do
>>>>>>     something different with the variable.  On FRV, for instance, the
>>>>>> address
>>>>>> is
>>>>>>     maintained in GR28 at all times inside the kernel as part of general
>>>>>> exception
>>>>>>     handling.
>>>>>>
>>>>>>
>>>>>>
>>>>>>     Having looked over the code, it appears that the parameter may be
>>>>>> handed
>>>>>> down
>>>>>>     through up to twenty or so layers of functions.  Consider a USB
>>>>>> character
>>>>>>     device attached to a USB hub, attached to a USB controller that posts
>>>>>> its
>>>>>>     interrupts through a cascaded auxiliary interrupt controller.  A
>>>>>> character
>>>>>>     device driver may want to pass regs to the sysrq handler through the
>>>>>> input
>>>>>>     layer which adds another few layers of parameter passing.
>>>>>>         PARTIAL_FRAME 1 8               /* offset 8: return address */
>>>>>>         subq $REST_SKIP, %rsp
>>>>>>         CFI_ADJUST_CFA_OFFSET REST_SKIP
>>>>>>         call save_rest
>>>>>>         DEFAULT_FRAME 0 8               /* offset 8: return address */
>>>>>>         leaq 8(%rsp), \arg      /* pt_regs pointer */
>>>>>>         call \func
>>>>>>         jmp ptregscall_common
>>>>>>         CFI_ENDPROC
>>>>>>
>>>>>>     I've build this code with allyesconfig for x86_64 and i386.  I've
>>>>>> runtested the
>>>>>>     main part of the code on FRV and i386, though I can't test most of the
>>>>>> drivers.
>>>>>>     I've also done partial conversion for powerpc and MIPS - these at
>>>>>> least
>>>>>> compile
>>>>>>     with minimal configurations.
>>>>>>
>>>>>>     This will affect all archs.  Mostly the changes should be relatively
>>>>>> easy.
>>>>>>     Take do_IRQ(), store the regs pointer at the beginning, saving the old
>>>>>> one:
>>>>>>
>>>>>>         struct pt_regs *old_regs = set_irq_regs(regs);
>>>>>>
>>>>>>     And put the old one back at the end:
>>>>>>
>>>>>>         set_irq_regs(old_regs);
>>>>>>
>>>>>>     Don't pass regs through to generic_handle_irq() or __do_IRQ().
>>>>>>
>>>>>>     In timer_interrupt(), this sort of change will be necessary:
>>>>>>
>>>>>>         -       update_process_times(user_mode(regs));
>>>>>>         -       profile_tick(CPU_PROFILING, regs);
>>>>>>         +       update_process_times(user_mode(get_irq_regs()));
>>>>>>         +       profile_tick(CPU_PROFILING);
>>>>>>
>>>>>>     I'd like to move update_process_times()'s use of get_irq_regs() into
>>>>>> itself,
>>>>>>     except that i386, alone of the archs, uses something other than
>>>>>> user_mode().
>>>>>>
>>>>>>     Some notes on the interrupt handling in the drivers:
>>>>>>
>>>>>>      (*) input_dev() is now gone entirely.  The regs pointer is no longer
>>>>>> stored in
>>>>>>          the input_dev struct.
>>>>>>
>>>>>>      (*) finish_unlinks() in drivers/usb/host/ohci-q.c needs checking.  It
>>>>>> does
>>>>>>          something different depending on whether it's been supplied with
>>>>>> a
>>>>>> regs
>>>>>>          pointer or not.
>>>>>>
>>>>>>      (*) Various IRQ handler function pointers have been moved to type
>>>>>>          irq_handler_t.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>