Feeds

That Linux AMD bug in Technicolor detail

Dirty cache, filthy lucre

  • alert
  • submit to reddit

Letters When we first started to learn about SMP systems, our first thought was - gosh, don't they go on about caches a lot?

But it isn't just in parallel processing where the hairy business of cache coherency is a problem, as we've seen with the AMD Linux bug blame game. This has affected uniprocessor systems, and I was stumped. But Reg readers have provided a wealth of detail, and what follows will take you from a bird's eye view to the low-level nasties.

Jud Leonard provided the best summary:

There's no way for the OS to know when it should do that cache flush. The flush would have to be done after the speculative write, but before the AGP attempted to write the same word. And the software doesn't even think that the code which did the speculative write ever got executed. It was something the processor did to get ready for instructions it thought the program was about to execute, but then the program branched off somewhere else.

I think one can make a case that this is a design error in the Athlon, though it is arguable either way. Similar problems have come up in the Alpha Ev6, and I would assume, most out-of-order processors.

The page size option matters because if you're using 4k pages, the processor doesn't have valid mappings for the pages that are being used by the AGP, so those speculative writes to bad pages don't get performed, and the coherence problem doesn't arise.

Or as John Riddoch summarizes:


My understanding is that the CPU will happily page blocks of 4MB into cache if the pages are set to be cacheable. Unfortunately, this 4MB can include some 4k pages that the GART is using, and isn't cache aware. So, the OS/CPU caches a 4MB block of data which includes one or more 4k blocks used by the GART. Before the CPU/OS pages this back to main RAM, the GART changes one or more of these blocks but this change is merrily overwritten with "stale" data.

If you switch to 4k blocks, the OS will never cache any of these as any GART data will use up a whole 4k block (at least, the block will be marked as
used).

Need more detail? Read on. This note from Lawrence D'Oliveiro from New Zealand details why the 4MB page triggered the problem:

The problem lies in conflicting accesses to a block of memory by both the AGP processor and the CPU. The problem is more likely to occur with a 4MB page size, I assume because the large page size makes it more likely for the CPU's memory mappings to collide with the AGP processor's ones.

A simple cache flush doesn't solve the problem, because all a cache flush does is explicitly force synchronization between the cache and main memory (synchronization which will normally happen at some point anyway). Because the memory block was marked for write access when it was loaded into the cache in the first place, this synchronization takes place by doing a write back to memory. Unfortunately, this clobbers data which was already written to the same memory by the AGP processor. Hence the problem.

Though it does seem a bit dumb that an AMD CPU has to write back bits to memory even when they haven't changed...

Richard Urich adds:

With a 4M page, the OS may wind up assigning memory address X to AGP.

However, some data may end at address X-1. This means if a loop is writing data from X-100 to X-1, the processor will likely mispredict when you are done and by accident think it is also writing data to address X. It will of course realize it's mistake and not write to X, but the data will already be loaded into cache. Then when the Athlon finally figures out it's not going to write to X, it will put it's cache value back to X leaving you able to experience problems. The problem occurs when a 4M page is being used by more than 1 thing, some of which are cacheable and some of which are not.

With 4K pages though, only 1 thing should be using any given page.

As for the flushing, I would think you could invalidate but I'm not sure how easy it would be to tell when you need to, and you would need a pretty big guarantee nothing useful was on that cache line. It's probably better to just move AGP to a page set to non-cacheable.

With a 4M page, the OS may wind up assigning memory address X to AGP. However, some data may end at address X-1. This means if a loop is writing data from X-100 to X-1, the processor will likely mispredict when you are done and by accident think it is also writing data to address X. It will of course realize it's mistake and not write to X, but the data will already be loaded into cache. Then when the Athlon finally figures out it's not going to write to X, it will put it's cache value back to X leaving you able to experience problems. The problem occurs when a 4M page is being used by more than 1 thing, some of which are cacheable and some of which are not.
With 4K pages though, only 1 thing should be using any given page.

As for the flushing, I would think you could invalidate but I'm not sure how easy it would be to tell when you need to, and you would need a pretty big guarantee nothing useful was on that cache line. It's probably better to just move AGP to a page set to non-cacheable.

Erich Boleyn offers even more detail:

When the following 3 conditions occur:

1. memory is marked by a page table mapping as "cacheable".

2. the mapping is actively in the data TLB (i.e. not a TLB miss), and doesn't need to be fetched.

3. an instruction speculatively writes to data in that page (note that this might be in indirect memory reference using a predicted, but incorrect address! so it could be the instruction wasn't really intended to write there... I'm not exactly sure what their boundary conditions to issue a cache-line fetch here are).

...then the Athlon series of processors mark the cacheline being loaded from the bus as dirty, even though it may never get new data written into it.

All dirty cache-lines must be written back at some point. When this happens for a memory region which was supposed to be uncacheable (and doesn't participate in the cache-coherency protocol, like the AGP controller), then it may overwrite something else that was placed into that region, or if it was uncacheable memory representing, say, a memory-mapped I/O area for a device, then who knows what the consequences would be.

The reason the AGP GART mapping in the chipset has such a problem with 4MB page mappings in the CPU is that in the standard usage model for 4MB page mappings, you just map ALL of RAM with them to reduce the number of TLB misses, and then the subset of RAM which gets used by the AGP GART overlaps them.

4KB mappings marked as cacheable would still be a problem, but both:

a) the likelihood of them being in the TLB at the time a speculative instruction comes along that wants to write to that area is small.

and

b)the likelihood of that particular page being currently mapped in the kernel/userspace as cacheable is much smaller.

To my knowledge, no other processor (certainly no Intel x86 processor), even non-x86, has this "feature", and therefore would not have this problem.

Technically, according to the specs for how cacheability and page mappings are described, the OSes/software in question should be written such that any uncacheable/incoheret area is NEVER marked by page mappings as cacheable, but because of the way Intel implemented theirs (and earlier AMD/other vendor x86-compatibles), people were sloppy and got away with it.

Finally, after a similarly exhaustive account of the problem, regular Tom Walsh concludes, It stretches the limits of my imagination to call this an OS problem."

Yes, but all that trouble with AGP on a single processor system. What's it going to be like on 2 way?

Thanks to all who wrote in. ®

Related stories

 The Linux-AMD AGP bug - who's to blame?
AMD chip bug snares Linux users

Whitepapers

Cloud and hybrid-cloud data protection for VMware
Learn how quick and easy it is to configure backups and perform restores for VMware environments.
A strategic approach to identity relationship management
ForgeRock commissioned Forrester to evaluate companies’ IAM practices and requirements when it comes to customer-facing scenarios versus employee-facing ones.
High Performance for All
While HPC is not new, it has traditionally been seen as a specialist area – is it now geared up to meet more mainstream requirements?
Three 1TB solid state scorchers up for grabs
Big SSDs can be expensive but think big and think free because you could be the lucky winner of one of three 1TB Samsung SSD 840 EVO drives that we’re giving away worth over £300 apiece.
Security for virtualized datacentres
Legacy security solutions are inefficient due to the architectural differences between physical and virtual environments.