Wow it’s been awhile. Life in the land of Linux graphics has been exciting recently, and there have been a few interesting developments on the Linux PCI front as well.
Linux Graphics Maturing
The Linux graphics stack has really been maturing recently. The Intel and radeon KMS drivers are seeing a lot of bug fixing, and nouveau is getting into shape as well. I think the Intel driver is in better shape than the userland driver ever was at this point (though that’s not to say it’s without defects; our serious bug count is still way too high for my liking). It supports more hardware and features, including power saving, DisplayPort, new hardware, advanced rendering APIs, than ever, and has been shipping in Linux distros for quite awhile now.
We recently finished off the page flipping support, and landed it upstream (it’ll be part of 2.6.33). We also landed a new, core, buffer execution interface (creatively named execbuf2), that allows for more flexibility in the way we submit our command buffers. Specifically, it allows us to control whether a given buffer needs to be mapped with a fence register for operations performed by the commands in its parent execution buffer. This allows our command buffers to be larger, since we won’t exhaust our fence registers prematurely by mapping all objects unconditionally, and allows us to enable tiled texture rendering on pre-965 chips, which can improve performance significantly for some types of rendering.
To support the page flipping work, I had to extend the DRI2 protocol a bit to include support for a SwapBuffers request. While I was at it, I added support for the SGI_video_sync and OML_sync_control extensions, which meant adding support for a few more requests. The SGI_video_sync addition was an important one, since its absence was a regression relative to DRI1. All this new protocol meant new Mesa and X server code, new DRI2 interfaces between the server and DDX drivers, and a bunch of testing and reworking of the interfaces as I figured things out.
All these new features are landed now, and should be a part of Linux 2.6.33, Mesa 7.8, X server 1.9 and xf86-video-intel 2.11. See CompositeSwap for an overview of the features and how they’re implemented. With that out of the way I’ve been able to think more about how compositors and clients should interact, so I came up with CNP. It’s not implemented yet, since I’m still gathering feedback on it, but my hope is that it will help us reduce memory consumption and partial frames in composited environments, as well as address some of the undefined behavior of current GLX calls when drawables are redirected.
Finally, after some discussions with toolkit and compositor developers, I worked with Kristian and Ian to come up with the INTEL_swap_event GLX extension (note it’s definitely possible to implement this on non-Intel as well, but only Intel has support at the moment). This extension allows GLX clients to receive X events when previously queued buffer swaps complete. So rather than making another swap call before the previous one has completed, clients with mainloops can simply poll their X event queue and do other work if their last swap isn’t done yet, rather than wasting time blocked in the server or queuing another swap and getting too far ahead of the display.
Using it all
One side effect of the new DRI2 code is that glXSwapBuffers calls are now totally asynchronous. Previous versions of DRI1 and DRI2 would either block waiting for vblank, or only return after the blit to implement the swap had completed. With the new code, a DRI2SwapBuffers protocol request ends up in the X server, where it’s scheduled by the DDX driver to occur at some later time (though in some cases it will happen immediately, e.g. if the drawable is offscreen). This leaves more time for clients to do other work while their swap occurs; the INTEL_swap_event extension can help clients take advantage of this extra CPU time.
Some optimizations are present in the new code as well. For instance, if the drawable is the same size as the current root window pixmap and there’s no clipping to worry about, the DDX driver can queue a page flip instead. This saves a tremendous amount of memory bandwidth, and so can really increase performance, especially on high resolution and/or bandwidth starved configurations (e.g. most integrated and embedded graphics platforms). Similarly, if a simple back to front copy is requested for a window, if the back and front pixmaps are the same size (i.e. the window manager hasn’t reparented the front window to accommodate decorations and the like), the DDX can simply exchange the backing pixmap object pointers rather than blit. Again this is important on low memory bandwidth platforms (though note this code is currently disabled due to lack of testing; however it’s trivial to enable once I have some test cases).
With our Core i7 parts launched, I can talk about some of the hardware feature work we’ve been doing. Zhenyu has been doing most of the bringup and hardware support work for this platform, but I’ve been busy with one of the more interesting hardware features in the Core i7-6xx series, called Intelligent Power Sharing (IPS). Core i7-6xx and 7xx chips are MCP (multi-chip packages); both the CPU and GPU/MCH are in the same physical processor package, but not on the same die. This means they share a thermal and power design domain. In many cases, only one of the components will be very busy, and thus generating much heat or drawing much power, and it would be a waste to let any extra thermal or power headroom go unused. IPS allows one component to use more than its share of power or thermal budget so long as the other component is idle enough to allow it. One of the key parts of this technology is so-called “graphics turbo", in other words the capability of the GPU to exceed its default frequency (and therefore thermal and power budget) when possible. I posted support for this at around launch time (latest patch here), and hope to be able to post the full IPS driver soon, since the potential graphics performance upside is fairly large (still collecting measurements but I’m hoping for something around 15% or maybe even a little higher). The code also allows the GPU to downclock when idle, saving power. The CPU already has its own opportunistic turbo mode which is very effective, but there may be cases where giving it extra power will be helpful (though I’ve yet to find a benchmark, again I’m still testing).
A recent thread highlighted an interesting design choice in Linux. All platforms supporting PCI (indeed pretty much every platform, PCI or no), splits its address space into multiple regions, allowing for memory mapped I/O (MMIO) from the CPU to different devices. Discovering which ranges belong to which devices is done in a number of different ways, from hard coded offsets (as is found on many embedded platforms), to firmware descriptor tables (as found in OpenFirmware or ACPI), to physically reading MMIO routing information from CPU host bridges down through the hierarchy.
There’s a drive in Linux to support the last option. After all, Linux is the operating system driving your hardware, it should do everything itself, right? Well, that’s where we get into trouble. Linux usually runs on platforms designed for Windows (either specifically for Windows or for Windows in addition to Linux). Windows generally uses the second option to make it easier to port to new platforms. For better or for worse (usually the latter) BIOS writers for new platforms generally consider their work done when Windows boots on their new platform and the Windows device manager doesn’t have any dreaded “yellow bang"s next to devices in the device tree. This usually means the ACPI tables used to describe MMIO layout need to be fairly accurate, or Windows may map a device into a location occupied by another or by a host bridge range with decode priority, causing hangs, corruption or the dreaded “yellow bang".
In October of last year, for arguably good reason, we tried to take Linux down the last path. Yinghai Lu added support for reading root bus resource ranges directly from the host bridge on Intel systems. The thought was that we’d be insulated from firmware bugs this way, and have a more accurate view of the system in general. Unfortunately, due to the above, bridge vendors like Intel have no reason to fully document all the decode windows of a given host bridge, which bits might enable or disable decode for a given region, or generally worry about providing the sort of info we’d need to make this approach tenable. So as of now, we’ve removed the supporting code, and are placing a bet that using the same information Windows does (and hopefully in the same way) will give us the same level of portability. We actually tried this back in 2.6.31 I believe, but had to disable it because our resource tracking code couldn’t handle all the resources handed us by some ACPI firmware implementations. We (well Bjorn hopefully) should fix that limitation for 2.6.34, and we’ll try again, and hopefully fix quite a few resource mapping related bugs in the process.