01/29/10

Permalink 02:37:07 pm, by jbarnes Email , 1575 words, 2107 views   English (US)
Categories: Announcements [A]

progress

Progress

Wow it’s been awhile. Life in the land of Linux graphics has been exciting recently, and there have been a few interesting developments on the Linux PCI front as well.

Linux Graphics Maturing

The Linux graphics stack has really been maturing recently. The Intel and radeon KMS drivers are seeing a lot of bug fixing, and nouveau is getting into shape as well. I think the Intel driver is in better shape than the userland driver ever was at this point (though that’s not to say it’s without defects; our serious bug count is still way too high for my liking). It supports more hardware and features, including power saving, DisplayPort, new hardware, advanced rendering APIs, than ever, and has been shipping in Linux distros for quite awhile now.

We recently finished off the page flipping support, and landed it upstream (it’ll be part of 2.6.33). We also landed a new, core, buffer execution interface (creatively named execbuf2), that allows for more flexibility in the way we submit our command buffers. Specifically, it allows us to control whether a given buffer needs to be mapped with a fence register for operations performed by the commands in its parent execution buffer. This allows our command buffers to be larger, since we won’t exhaust our fence registers prematurely by mapping all objects unconditionally, and allows us to enable tiled texture rendering on pre-965 chips, which can improve performance significantly for some types of rendering.

To support the page flipping work, I had to extend the DRI2 protocol a bit to include support for a SwapBuffers request. While I was at it, I added support for the SGI_video_sync and OML_sync_control extensions, which meant adding support for a few more requests. The SGI_video_sync addition was an important one, since its absence was a regression relative to DRI1. All this new protocol meant new Mesa and X server code, new DRI2 interfaces between the server and DDX drivers, and a bunch of testing and reworking of the interfaces as I figured things out.

All these new features are landed now, and should be a part of Linux 2.6.33, Mesa 7.8, X server 1.9 and xf86-video-intel 2.11. See CompositeSwap for an overview of the features and how they’re implemented. With that out of the way I’ve been able to think more about how compositors and clients should interact, so I came up with CNP. It’s not implemented yet, since I’m still gathering feedback on it, but my hope is that it will help us reduce memory consumption and partial frames in composited environments, as well as address some of the undefined behavior of current GLX calls when drawables are redirected.

Finally, after some discussions with toolkit and compositor developers, I worked with Kristian and Ian to come up with the INTEL_swap_event GLX extension (note it’s definitely possible to implement this on non-Intel as well, but only Intel has support at the moment). This extension allows GLX clients to receive X events when previously queued buffer swaps complete. So rather than making another swap call before the previous one has completed, clients with mainloops can simply poll their X event queue and do other work if their last swap isn’t done yet, rather than wasting time blocked in the server or queuing another swap and getting too far ahead of the display.

Using it all

One side effect of the new DRI2 code is that glXSwapBuffers calls are now totally asynchronous. Previous versions of DRI1 and DRI2 would either block waiting for vblank, or only return after the blit to implement the swap had completed. With the new code, a DRI2SwapBuffers protocol request ends up in the X server, where it’s scheduled by the DDX driver to occur at some later time (though in some cases it will happen immediately, e.g. if the drawable is offscreen). This leaves more time for clients to do other work while their swap occurs; the INTEL_swap_event extension can help clients take advantage of this extra CPU time.

Some optimizations are present in the new code as well. For instance, if the drawable is the same size as the current root window pixmap and there’s no clipping to worry about, the DDX driver can queue a page flip instead. This saves a tremendous amount of memory bandwidth, and so can really increase performance, especially on high resolution and/or bandwidth starved configurations (e.g. most integrated and embedded graphics platforms). Similarly, if a simple back to front copy is requested for a window, if the back and front pixmaps are the same size (i.e. the window manager hasn’t reparented the front window to accommodate decorations and the like), the DDX can simply exchange the backing pixmap object pointers rather than blit. Again this is important on low memory bandwidth platforms (though note this code is currently disabled due to lack of testing; however it’s trivial to enable once I have some test cases).

New hardware

With our Core i7 parts launched, I can talk about some of the hardware feature work we’ve been doing. Zhenyu has been doing most of the bringup and hardware support work for this platform, but I’ve been busy with one of the more interesting hardware features in the Core i7-6xx series, called Intelligent Power Sharing (IPS). Core i7-6xx and 7xx chips are MCP (multi-chip packages); both the CPU and GPU/MCH are in the same physical processor package, but not on the same die. This means they share a thermal and power design domain. In many cases, only one of the components will be very busy, and thus generating much heat or drawing much power, and it would be a waste to let any extra thermal or power headroom go unused. IPS allows one component to use more than its share of power or thermal budget so long as the other component is idle enough to allow it. One of the key parts of this technology is so-called “graphics turbo", in other words the capability of the GPU to exceed its default frequency (and therefore thermal and power budget) when possible. I posted support for this at around launch time (latest patch here), and hope to be able to post the full IPS driver soon, since the potential graphics performance upside is fairly large (still collecting measurements but I’m hoping for something around 15% or maybe even a little higher). The code also allows the GPU to downclock when idle, saving power. The CPU already has its own opportunistic turbo mode which is very effective, but there may be cases where giving it extra power will be helpful (though I’ve yet to find a benchmark, again I’m still testing).

PCI

A recent thread highlighted an interesting design choice in Linux. All platforms supporting PCI (indeed pretty much every platform, PCI or no), splits its address space into multiple regions, allowing for memory mapped I/O (MMIO) from the CPU to different devices. Discovering which ranges belong to which devices is done in a number of different ways, from hard coded offsets (as is found on many embedded platforms), to firmware descriptor tables (as found in OpenFirmware or ACPI), to physically reading MMIO routing information from CPU host bridges down through the hierarchy.

There’s a drive in Linux to support the last option. After all, Linux is the operating system driving your hardware, it should do everything itself, right? Well, that’s where we get into trouble. Linux usually runs on platforms designed for Windows (either specifically for Windows or for Windows in addition to Linux). Windows generally uses the second option to make it easier to port to new platforms. For better or for worse (usually the latter) BIOS writers for new platforms generally consider their work done when Windows boots on their new platform and the Windows device manager doesn’t have any dreaded “yellow bang"s next to devices in the device tree. This usually means the ACPI tables used to describe MMIO layout need to be fairly accurate, or Windows may map a device into a location occupied by another or by a host bridge range with decode priority, causing hangs, corruption or the dreaded “yellow bang".

In October of last year, for arguably good reason, we tried to take Linux down the last path. Yinghai Lu added support for reading root bus resource ranges directly from the host bridge on Intel systems. The thought was that we’d be insulated from firmware bugs this way, and have a more accurate view of the system in general. Unfortunately, due to the above, bridge vendors like Intel have no reason to fully document all the decode windows of a given host bridge, which bits might enable or disable decode for a given region, or generally worry about providing the sort of info we’d need to make this approach tenable. So as of now, we’ve removed the supporting code, and are placing a bet that using the same information Windows does (and hopefully in the same way) will give us the same level of portability. We actually tried this back in 2.6.31 I believe, but had to disable it because our resource tracking code couldn’t handle all the resources handed us by some ACPI firmware implementations. We (well Bjorn hopefully) should fix that limitation for 2.6.34, and we’ll try again, and hopefully fix quite a few resource mapping related bugs in the process.

10/26/09

Permalink 11:15:47 am, by jbarnes Email , 531 words, 13916 views   English (US)
Categories: Announcements [A]

So I followed Paul Mundt into this narrow alley...

Back from Japan at last (I think United lost my sleep schedule on the way home though, trying to retrieve it this weekend has been a challenge).

Both KS and JLS went well I thought. It was really good to connect with some of the Japanese developers that until now I’ve only interacted with through email.

The summit went well this year I thought. We didn’t have a big set of controversial issues to discuss, but we did sort out some development process issues. The highlight for me was the two customer panels. On the first day we had some people from TV and other vendors talk about how they’re using the kernel and other open source software. It’s interesting that some of them are stuck way back on 2.4 and very early 2.6 kernels. Part of the reason is long product development cycles, but mostly it’s because the SoCs used in many products only have support in a limited set of kernels (usually custom patches for specific kernels provided by companies like Montavista). The “platformization” work done by tglx and the x86 team recently (partly motivated by Intel “Moorestown” support, but also in preparation for more x86 based SoCs in the future) should help with this for x86 stuff. We definitely want to avoid an ARM-like situation where each SoC requires a specific kernel with incompatible firmware and hardware support. I had some good discussions with Linus and Paul on that topic; the tricky part will be ensuring that vendors adhere to some level of standardization in their platform and firmware support. Doing so will have big benefits: upstream kernel support should be better and much more flexible (good for the SoC vendors and their customers), and the platform maintainers should have a much easier job integrating support for new platforms without a huge set of ifdefs and incompatible firmware interfaces. Managed to get a few bugs fixed at KS as well, Ted & Dirk didn’t have anywhere to run when I wanted them to test some patches for problems they’d reported!

The JLS conference was interesting too, with a few good talks on things like barcode delivery of oops info and btrfs

Tokyo is a pretty amazing city. This was my first trip to Japan and a few of us were fortunate enough to have Paul Mundt guide us for a couple of evenings to explore the city. The narrow alleyways and tiny bars in the Shinjuku (at least I think that’s where we ended up) were really fun. We even checked out a Mexican bar called Bonita; Mexican stuff outside the southwest US and Mexico is always interesting, but the Japanese mix made things even more so. Overall a fun night including Japanese Denny’s food, passed out salarymen, and an everything store with some bizarre costumes, including some furry outfits we were tempted to buy… A bit later in the week we had a contrasting experience by going to Seamon (one of the dozens of one star Michelin sushi restaurants in Tokyo) and a high end scotch and cigar bar afterwards.

Ok now back to catching up on the huge backlog of patches that have accrued due to travel neglect.

09/21/09

Permalink 01:51:25 pm, by jbarnes Email , 194 words, 3283 views   English (US)
Categories: Announcements [A]

Off to Portland

Heading off to Portland tomorrow for Plumber’s (Wed-Fri) and XDC (following Mon & Tue).

I’m hoping we can get the compositing & GLX architectural improvements for X nailed down (see http://dri.freedesktop.org/wiki/CompositeSwap and our talk outline at http://linuxplumbersconf.org/ocw/proposals/70). My implementation of some of those features is already limping along and I’m hoping we can land it soon and start on some compositor improvements to take advantage of the new features.

The merge window has been busy so far, with a few good PCI improvements landed (including VGA arbitration, finally!), and a bunch of gfx related stuff: a slew of power management improvements (dynamic render, display and refresh rate controls, framebuffer compression, RAM self-refresh mode enabling and a bunch of clock gating enables), several stability improvements (GEM memory shrinker, bug fixes for our ring management, reloc range checking, automatic GPU reset support), and a few performance improvements (madvise support for GPU buffers). Overall very busy and very cool stuff. I’m still holding out hope we can land the page flipping and execbuf2 code this cycle; if the merge window stays open through Plumbers that should be possible.

08/28/09

Permalink 04:17:14 pm, by jbarnes Email , 18 words, 3499 views   English (US)
Categories: Announcements [A]

KDB+KMS FTW

07/08/09

Permalink 12:57:16 pm, by jbarnes Email , 1331 words, 10739 views   English (US)
Categories: Announcements [A]

Morning in America^WLinuxland

It’s been about two months since my last update. Most of last month was spent travelling, first to UDS in Barcelona (one of my favorite cities btw), then to London to work at OTC Europe on Moblin, and then on to Oregon to meet with the greater graphics team. In between flights and whenever I had time I also got some good work done; working through PCI patches, improving the error detect/collection code, testing the GPU reset patch a bit more, working on 3D tiling for pre-965 chipsets, and doing general bug fixing. But first things first.

UDS Barcelona

This was the second UDS in Spain I attended (first was in Sevilla a couple of years ago). Overall I found it to be a bit more organized and productive than the first. Probably a sign that the Ubuntu community and Canonical have grown a bit since then…

As I said, I thought this UDS was particularly productive, and ambitious to boot. Scott set an aggressive 10s boot time target for Karmic, and the desktop team decided to pull some very new technologies like KMS and DRI2 into the Karmic release. This is great news for users due to the improved feature set, but also makes life much better for upstream developers, since supporting the new code is much easier than dealing with old DRI1 stuff.

Another huge development (which actually pre-dates this UDS) is the xorg edgers repo. It’s a PPA containing packages of the graphics stack (kernel, libdrm, Mesa, X server and drivers) directly from git. Having this available means testing and development are greatly accelerated; now when users report a bug in the Karmic repos, we can ask them to quickly and easily test the xorg edgers bits to see if their issue has been fixed. If so, we know a backport may be needed, and if not we have a good bug report to feed upstream. I run this repo myself, typically updating every morning, and have found and fixed quite a few bugs as a result of finding them early. Robert and the rest of the edgers team deserve huge thanks from everyone in the Linux community for their work on this repo. I hope their example is followed by other projects, maybe for audio, bluetooth or wireless stacks, which also have large kernel and userland components.

Some other cool stuff got demoed at UDS this time, including some cool Android on Ubuntu work and an Ubuntu spin of Moblin! The latter is especially cool, and I’m hoping to install it on one of my Netbooks soon.

Graphics stuff

The London trip was something of a last minute affair. I was already in Barcelona, and our OTC Europe team in London was in the middle of working on some graphics related Moblin stuff, so they sent Eric, Ian and myself over there to help out. It turned out to be a very productive week; we fixed some major issues while there and overall improved performance by about 50% on some workloads. Some of that work will make its way into our next release and future kernels.

One of the big issues we worked on over there was implementing tiling for 3D textures. We did this awhile ago for some of the major buffers (front, back, depth), but doing it for textures is a bit more involved. Eric quickly got 965+ tiling working, but pre-965 turned out to be a bit more of a pain due to its fencing requirements. The 2D and display engines on pre-965 chips need fence registers to cover tiled regions in order to blit or scan them out. The 3D engine can handle tiled surfaces directly, without fences. The current execbuffer code (the central command submission mechanism) will always allocate fence registers for tiled surfaces however, and on chips with only 8 fence regs (915 and prior) that can be a problem. Not only are fence registers scarce, but mapping and unmapping objects with them can be an expensive operation. So I came up with the execbuffer2 interface, which adds a new relocation type to handle the fence register requirement. Commands using 2D blits for example can use a reloc type that indicates a fence register is required, while purely 3D commands can avoid it. There’s potential for more improvement if we remove 2D blits from the DRI driver, though that may involve more overhead than we’d like, due to the higher setup costs. As it stands, tiling textures can give us a ~20% performance boost on some workloads…

Another thing I had some time to work on at UDS and in London was GPU error handling. Our GPUs have some error handling capabilities we haven’t really taken advantage of until recently. Eric and Carl recently improved the GPU dumping utility significantly, which really makes GPU hang debugging possible. To make things even easier, I put together a patch to use the GPU error interrupt to trigger a error state capture and generate a uevent. The idea here is to capture the error state from the first error, then tell userspace it should capture a full ring & batch buffer dump as soon as possible. This should allow for automated reporting, ala kerneloops, of GPU related errors. The second part of the work involves automatically resetting the GPU when a hang is detected. I posted working patch for that aspect of error handling, but it still needs a little work on the hang detection side before we can push it upstream. I’m hoping both that and the execbuffer2 stuff will land in 2.6.32 (since it’s a bit late for big stuff to land in 2.6.31); the reset patch should be fairly easy to backport though.

Oh yeah, BUGS! The past couple of months have seen unprecedented bug fixing activity on the Intel graphics stack. The removal of the DRI1, XAA, EXA and much of the non-memory manager code is really started to pay off. We’ve been fixing bugs left and right lately, really stabilizing the drivers in both KMS and non-KMS configurations. Not having to worry as much about breaking some weird configuration possibility is a big help (though the non-DRM case still bites us from time to time). In short, things are really looking good on the graphics front these days; the major architectural changes are complete now, and we can really focus on making things solid.

PCI queue

And on the belated “what’s up with PCI this cycle” front, we had one major issue this time around. In an effort to keep the kernel from using BIOS reserved areas, we started using the ACPI _CRS (current resource settings) data from root bridges at boot time, to describe the set of resources on the root bus. Turns out some BIOSes list a ton of resources in _CRS, including legacy VGA and I/O port space. This can be helpful (since the alternative is having a huge list of chipsets and what ranges they’re hardcoded or programmed to decode), but the PCI layer isn’t quite ready to handle arbitrary numbers of bus resource ranges and types. So early in the cycle Linus had to revert the move to using _CRS by default; that said we did get some good fixes from Gary and Yinghai for the _CRS case, so eventually we should be able to use that data in some form.

Of course, eventually I’m hoping to add something like TJ’s resource management code to Linux. Unfortunately TJ seems to have disappeared, and he never did post his code. Fortunately he did leave a good set of notes on his website about what he’d discovered (e.g. which bits of info are reliable and how other OSes use ACPI data etc). The hard part is finding time to implement such a large change and get it tested well enough to feed upstream. The other perennial topic is VGA arbitration. Tiago recently updated his patchset for that, but I haven’t seen it formally submitted to the linux-pci mailing list yet…

:: Next Page >>

Virtuous blogs

| Next >

February 2010
Sun Mon Tue Wed Thu Fri Sat
 << <   > >>
  1 2 3 4 5 6
7 8 9 10 11 12 13
14 15 16 17 18 19 20
21 22 23 24 25 26 27
28            

Search

Misc

XML Feeds

What is RSS?

Who's Online?

  • Guest Users: 8

powered by b2evolution free blog software