Weird rawhide desktop behavior

Mon Mar 26 14:27:48 UTC 2012

On Sat, 2012-03-24 at 11:58 -0600, Jonathan Corbet wrote:
> Here's a strange pathology that just bit me for the first time in a while,
> though I've seen it before.  I'm not sure where to file a bug on this
> one...

There's several levels of "X locked up" pathology, let's see if I can
shed some light here.  (For bonus points, someone who wanted to add this
kind of info to the wiki would be Way Cool.)

> In short: I'll be working away, minding my own business, when the desktop
> goes completely dead - no response to any key or mouse events.  That said,
> the X server is still running; the pointer still moves with the mouse.  I
> can also switch to another virtual console with alt-ctrl-Fn.  Sometimes
> things start working again after some time (measured in minutes);
> sometimes I lose patience and start over.  Today I went and made lunch and
> it never came back.

The pointer position (but not image) updates during a SIGIO handler if
you have hardware cursors enabled [1].  How do you know if you have
hardware cursors?  Short answer is, you do, unless you're running a dumb
driver like vesa/fbdev/modesetting.

So, class 1 lockup here is "I can't move the cursor", and boy are you in
trouble.  For KMS drivers this usually means X is waiting on a blocking
DRM ioctl; ps will show X in D state, and /proc/$(pidof Xorg)/wchan will
show you somewhere in ioctl land.  This is always a video driver bug,
and you will typically see something in dmesg when this happens.  Don't
bother trying to get an xserver backtrace here, ptrace can't attach to
D-state processes.

Class 2 lockup is "I can move the cursor, but the image never changes",
as in, if you mouse over a text entry field it doesn't change to the
vertical bar, or over a resize grip it doesn't change to a resize
indicator.  Here, the X server is stuck somewhere away from the main
loop, but at least isn't stuck in the kernel.  gdb on X will work, and
will probably tell you where you're stuck.  This class is usually a
userspace bug, could be either the driver or the server.

Class 3 lockup is "I can move the cursor and it behaves normally, but I
can't type".  In this scenario X _is_ successfully going around its main
loop.  If you can VT switch, this is you; VT switch processing happens
while draining the event queue, which is driven off the main loop.  This
scenario has an outside chance of being an xserver bug, but typically
this is the server dutifully doing what clients have told it to do:
something takes a grab, and then deadlocks.  Sorry about X11, we keep
trying to get rid of it for a reason.

Class 3 here one could debug more readily if you had some of the
debugging key combos wired up in XKB:

http://cgit.freedesktop.org/xorg/xserver/commit/?id=7d2543a3cb3089241982ce4f8984fd723d5312a1

Sadly gnome does not yet have UI for this, and I don't remember how to
drive setxkbmap to add them.  Note that the Ungrab and CloseGrab combos
allow you to defeat screensaver locking - ie, they are security holes -
which is why they're not enabled by default.  You don't want to use them
anyway if you're debugging, you want PrintGrabs so you can then go
inspect the grabbing process to see why it's deadlocked.

> I've tried killing off applications to see if somebody has some sort of
> all-inclusive grab, but I can't find the right one if that's the case.  I
> can kill something like Firefox and verify that the process is gone, but
> the Firefox window remains on-screen when I return to X.

This is significant.  It means the compositor isn't repainting.  So
either:

a) the compositor isn't the client with the stuck grab,
b) the compositor's internal grab logic is broken

[1] - Why position but not image?  Because on most hardware position is
just one register to poke, but image updates require an image upload,
which isn't safe to do if the driver is in the middle of some other
accelerated rendering.  Why only for hardware cursor?  Because software
cursor rendering only caches the pixels behind the cursor on motion,
which means you could race with normal rendering.  Both of these you
could fix if you were willing to take much more of a mutex overhead than
you're probably okay with.

- ajax
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 198 bytes
Desc: This is a digitally signed message part
URL: <http://lists.fedoraproject.org/pipermail/test/attachments/20120326/38294f28/attachment.sig>