Performance Monitoring Events—AMD Athlon™ 64 and AMD Opteron™ Processors

This section describes processor performance monitor events available for performance analysis and tuning for AMD Athlon™ 64 and AMD Opteron™ processors. AMD Athlon 64 and AMD Opteron processors provide four 48-bit performance counters per available core, which allows four types of events to be monitored simultaneously. The performance counters are not guaranteed to be fully accurate and should be used as a relative measure of performance to assist in application tuning. Unlisted event numbers are reserved and their results are undefined.

Conventions, Definitions, and Special Notes

The Event Select value is used to select the event to be monitored. The Unit Mask is used to further qualify the event selected by the Event Select value. The Mask Value given here is an index and corresponds to actual 8-bit Unit Mask as specified in the following table.

Mask value Unit Mask
0 0x01
1 0x02
2 0x04
3 0x08
4 0x10
5 0x20
6 0x40
7 0x80

Unless otherwise stated, the Unit Mask values shown may be combined to select any desired combination of the sub-events for a given event. For events where no Unit Mask table is shown, the Unit Mask is not applicable.

Speculative vs. Retired events: Several events may include speculative activity, meaning they may be associated with false-path instructions that are ultimately discarded due to a branch misprediction. Events associated with Retire reflect actual program execution. For events where the distinction may matter, these are explicitly labeled as one or the other.

Dual-core operation: In AMD64 dual-core processors, each core has its own set of event counters. However, each core shares the event-select logic for events in the shared Northbridge logic, allowing an overwrite of a Northbridge event select (including unit mask) that was previously set up by the other core, changing the event that the first core thinks it is counting.

Note: This conflict between cores occurs between corresponding event counters, e.g., PMC0 vs. PMC0. So both cores cannot simultaneously monitor different Northbridge events using the same counter. When using the performance counters simultaneously in both cores, care must be taken to avoid this conflict, such as by having one core monitor the desired Northbridge events and the other core either monitor events internal to itself, or not use the corresponding event counters.

Rev E errata regarding dual-core processor operation: Rev E dual-core processors have an erratum whereby any write to an event select MSR, regardless of what event is being selected, will overwrite the Northbridge event selects for that counter. Hence the conflict described above exists even when the second core is being programmed for non-Northbridge events. This will be fixed in a future revision. The work-around for this is to have the Northbridge-monitoring core program its event counters after the other core has completed its own event counter setup.

For detailed information, refer to the BIOS and Kernel Developer's Guide for AMD Athlon™ 64 and AMD Opteron™ Processors, order# 26094.

 

 

Floating point events

Event Select 0x000 Dispatched FPU operations

Abbreviation: FPU ops

The number of operations (uops) dispatched to the FPU execution pipelines. This event reflects how busy the FPU pipelines are. This includes all operations done by x87, MMX® and SSE instructions, including moves. Each increment represents a one-cycle dispatch event; packed 128-bit SSE operations count as two ops; scalar operations count as one. Speculative. (See also event CBh).

Note: Since this event includes non-numeric operations it is not suitable for measuring MFLOPs.

Value Unit mask description
0 Add pipe ops
1 Multiply pipe ops
2 Store pipe ops
3 Add pipe load ops
4 Multiply pipe load ops
5 Store pipe load ops

Event Select 0x001 Cycles with no FPU ops retired

Abbreviation: No FPU op cycles

The number of cycles in which no FPU operations were retired. Invert this (set the Invert control bit in the event select MSR) to count cycles in which at least one FPU operation was retired.

Event Select 0x002 Dispatched fast flag FPU operations

Abbreviation: Fast flag FPU ops

The number of FPU operations that use the fast flag interface (e.g. FCOMI, COMISS, COMISD, UCOMISS, UCOMISD). Speculative.

 

 

Load/store and TLB events

Event Select 0x020 Segment register loads

Abbreviation: Seg reg loads

The number of segment register loads performed.

Value Unit mask description
0 ES
1 CS
2 SS
3 DS
4 FS
5 GS
6 HS

Event Select 0x021 Pipeline restart due to self-modifying code

Abbreviation: Restart self-mod code

The number of pipeline restarts that were caused by self-modifying code (a store that hits any instruction that's been fetched for execution beyond the instruction doing the store).

Event Select 0x022 Pipeline restart due to probe hit

Abbreviation: Restart probe hit

The number of pipeline restarts caused by an invalidating probe hitting on a speculative out-of-order load.

Event Select 0x023 LS buffer 2 full

Abbreviation: LS2 buffer full

The number of cycles that the LS2 buffer is full. This buffer holds stores waiting to retire as well as requests that missed the data cache and are waiting on a refill. This condition will stall further data cache accesses, although such stalls may be overlapped by independent instruction execution.

Event Select 0x024 Locked operations

Abbreviation: Locked ops

This event covers locked operations performed and their execution time. The execution time represented by the cycle counts is typically overlapped to a large extent with other instructions. The non-speculative cycles event is suitable for event-based profiling of lock operations that tend to miss in the cache.

Value Unit mask description
0 Number of locked instructions executed
1 Number of cycles spent in speculative phase
2 Number of cycles spent in non-speculative phase

Event Select 0x065 Memory requests by type

Abbreviation: Memory req

These events reflect accesses to uncachable (UC) or write-combining (WC) memory regions (as defined by MTRR or PAT settings) and Streaming Store activity to WB memory. Both the WC and Streaming Store events reflect Write Combining buffer flushes, not individual store instructions. WC buffer flushes which typically consist of one 64-byte write to the system for each flush (assuming software typically fills a buffer before it gets flushed). A partially-filled buffer will require two or more smaller writes to the system. The WC event reflects flushes of WC buffers that were filled by stores to WC memory or streaming stores to WB memory. The Streaming Store event reflects only flushes due to streaming stores (which are typically only to WB memory). The difference between counts of these two events reflects the true amount of write events to WC memory.

Value Unit mask description
0 Requests to non-cacheable (UC) memory
1 Requests to write-combining (WC) memory
7 Streaming store (SS) requests

Data cache events

Event Select 0x040 Data cache accesses

Abbreviation: DC accesses

The number of accesses to the data cache for load and store references. This may include certain microcode scratchpad accesses, although these are generally rare. Each increment represents an eight-byte access, although the instruction may only be accessing a portion of that. Speculative.

Event Select 0x041 Data cache misses

Abbreviation: DC misses

The number of data cache references which missed in the data cache. Speculative.

Except in the case of streaming stores, only the first miss for a given line is included - access attempts by other instructions while the refill is still pending are not included in this event. So in the absence of streaming stores, each event reflects one 64-byte cache line refill, and counts of this event are the same as, or very close to, the combined count for event 42h.

Streaming stores however will cause this event for every such store, since the target memory is not refilled into the cache. Hence this event should not be used as an indication of data cache refill activity - event 42h should be used for such measurements. (See event 65h for an indication of streaming store activity.) A large difference between events 41h (with all UNIT_MASK bits set) and 42h would be due mainly to streaming store activity.

Event Select 0x042 Data cache refills from L2 or system

Abbreviation: DC refills L2/sys

The number of data cache refills satisfied from the L2 cache (and/or the system), per the UNIT_MASK. UNIT_MASK bits 4:1 allow a breakdown of refills from the L2 by coherency state. UNIT_MASK bit 0 reflects refills which missed in the L2, and provides the same measure as the combined sub-events of event 43h. Each increment reflects a 64-byte transfer. Speculative.

Value Unit mask description
0 Refill from System
1 Shared-state line from L2
2 Exclusive-state line from L2
3 Owned-state line from L2
4 Modified-state line from L2

Event Select 0x043 Data cache refills from system

Abbreviation: DC refills sys

The number of L1 cache refills satisfied from the system (system memory or another cache), as opposed to the L2. The UNIT_MASK selects lines in one or more specific coherency states. Each increment reflects a 64-byte transfer. Speculative.

Value Unit mask description
0 Invalid
1 Shared
2 Exclusive
3 Owned
4 Modified

Event Select 0x044 Data cache lines evicted

Abbreviation: DC evicted

The number of L1 data cache lines written to the L2 cache or system memory, having been displaced by L1 refills. The UNIT_MASK may be used to count only victims in specific coherency states. Each increment represents a 64-byte transfer. Speculative.

In most cases, L1 victims are moved to the L2 cache, displacing an older cache line there. Lines brought into the data cache by PrefetchNTA instructions, however, are evicted directly to system memory (if dirty) or invalidated (if clean). There is no provision for measuring this component by itself. The Invalid case (UNIT_MASK value 01h) reflects the replacement of lines that would have been invalidated by probes for write operations from another processor or DMA activity.

Value Unit mask description
0 Invalid
1 Shared
2 Exclusive
3 Owned
4 Modified

Event Select 0x045 L1 DTLB miss and L2 DTLB hit

Abbreviation: DTLB L1M L2H

The number of data cache accesses that miss in the L1 DTLB and hit in the L2 DTLB. Speculative.

Event Select 0x046 L1 DTLB miss and L2 DTLB miss

Abbreviation: DTLB L1M L2M

The number of data cache accesses that miss in both the L1 and L2 DTLBs. Speculative.

Event Select 0x047 Misaligned accesses

Abbreviation: Misalign access

The number of data cache accesses that are misaligned. These are accesses which cross an eight-byte boundary. They incur an extra cache access (reflected in event 40h), and an extra cycle of latency on reads. Speculative.

Event Select 0x048 Microarchitectural late cancel of an access

Abbreviation: Late cancel

Event Select 0x049 Microarchitectural early cancel of an access

Abbreviation: Early cancel

Event Select 0x04A Single-bit ECC errors recorded by scrubber

Abbreviation: 1-bit ECC errors

The number of single-bit errors corrected by either of the error detection/correction mechanisms in the data cache.

Value Unit mask description
0 Scrubber error
1 Piggyback scrubber errors

Event Select 0x04B Prefetch instructions dispatched

Abbreviation: Prefetch inst

The number of prefetch instructions dispatched by the decoder. Speculative. Such instructions may or may not cause a cache line transfer. Any Dcache and L2 accesses, hits and misses by prefetch instructions are included in these types of events.

Value Unit mask description
0 Load (Prefetch, PrefetchT0/T1/T2)
1 Store (PrefetchW)
2 NTA (PrefetchNTA)

Event Select 0x04C DCACHE misses by locked instructions

Abbreviation: DC misses locked inst

The number of data cache misses incurred by locked instructions. (The total number of locked instructions may be obtained from event 24h.)

Such misses may be satisfied from the L2 or system memory, but there is no provision for distinguishing between the two. When used for event-based profiling, this event will tend to occur very close to the offending instructions. (See also event 24h.) This event is also included in the basic Dcache miss event (41h).

Value Unit mask description
1 Data cache misses by locked instructions

L2 cache and system interface events

Event Select 0x067 Data prefetcher

Abbreviation: Data prefetcher

These events reflect requests made by the data prefetcher. UNIT_MASK bit 1 counts total prefetch requests, while bit 0 counts requests where the target block is found in the L2 or data cache. The difference between the two represents actual data read (in units of 64-byte cache lines) from the system by the prefetcher. This is also included in the count of event 7Fh, UNIT_MASK bit 0 (combined with other L2 fill events).

Value Unit mask description
0 Cancelled prefetches
1 Prefetch attempts

Event Select 0x06C System read responses by coherency state

Abbreviation: Sys read resp

The number of responses from the system for cache refill requests. The UNIT_MASK may be used to select specific cache coherency states. Each increment represents one 64-byte cache line transferred from the system (DRAM or another cache, including another core on the same node) to the data cache, instruction cache or L2 cache (for data prefetcher and TLB table walks). Modified-state responses may be for Dcache store miss refills, PrefetchW software prefetches, hardware prefetches for a store-miss stream, or Change-to-Dirty requests that get a dirty (Owned) probe hit in another cache. Exclusive responses may be for any Icache refill, Dcache load miss refill, other software prefetches, hardware prefetches for a load-miss stream, or TLB table walks that miss in the L2 cache; Shared responses may be for any of those that hit a clean line in another cache.

Value Unit mask description
0 Exclusive
1 Modified
2 Shared

Event Select 0x06D Quadwords written to system

Abbreviation: Quad written to sys

The number of quadword (8-byte) data transfers from the processor to the system. These may be part of a 64-byte cache line writeback or a 64-byte dirty probe hit response, each of which would cause eight increments; or a partial or complete Write Combining buffer flush (Sized Write), which could cause from one to eight increments.

Value Unit mask description
0 Quadword write transfer

Event Select 0x07D Requests to L2 cache

Abbreviation: L2 requests

The number of requests to the L2 cache for Icache or Dcache fills, or page table lookups for the TLB. These events reflect only read requests to the L2; writes to the L2 are indicated by event 7Fh. These include some amount of retries associated with address or resource conflicts. Such retries tend to occur more as the L2 gets busier, and in certain extreme cases (such as large block moves that overflow the L2) these extra requests can dominate the event count.

These extra requests are not a direct indication of performance impact - they simply reflect opportunistic accesses that don't complete. But because of this, they are not a good indication of actual cache line movement. The Icache and Dcache miss and refill events (81h, 82h, 83h, 41h, 42h, 43h) provide a more accurate indication of this, and are the preferred way to measure such traffic.

Value Unit mask description
0 IC fill
1 DC fill
2 TLB fill (page table walks)
3 Tag snoop request
4 Cancelled request

Event Select 0x07E L2 cache misses

Abbreviation: L2 misses

The number of requests that miss in the L2 cache. This may include some amount of speculative activity, as well as some amount of retried requests as described in event 7Dh. The IC-fill-miss and DC-fill-miss events tend to mirror the Icache and Dcache refill-from-system events (83h and 43h, respectively), and tend to include more speculative activity than those events.

Value Unit mask description
0 IC fill
1 DC fill (includes possible replays)
2 TLB page table walk

Event Select 0x07F L2 fill/writeback

Abbreviation: L2 fill/write

The number of lines written into the L2 cache due to victim writebacks from the Icache or Dcache, TLB page table walks and the hardware data prefetcher (UNIT_MASK bit 0); or writebacks of dirty lines from the L2 to the system (UNIT_MASK bit 1). Each increment represents a 64-byte cache line transfer.

Note: Victim writebacks from the Dcache may be measured separately using event 44h. However this is not quite the same as the Dcache component of event 7Fh, the main difference being PrefetchNTA lines. When these are evicted from the Dcache due to replacement, they are written out to system memory (if dirty) or simply invalidated (if clean), rather than being moved to the L2 cache.

Value Unit mask description
0 L2 fills

 

 

Instruction cache events

Event Select 0x080 Instruction cache fetches

Abbreviation: IC fetches

The number of instruction cache accesses by the instruction fetcher. Each access is an aligned 16 byte read, from which a varying number of instructions may be decoded.

Event Select 0x081 Instruction cache misses

Abbreviation: IC misses

The number of instruction fetches that miss in the instruction cache. This is typically equal to or very close to the sum of events 82h and 83h. Each miss results in a 64-byte cache line refill.

Event Select 0x082 Instruction cache refills from L2

Abbreviation: IC refills from L2

The number of instruction cache refills satisfied from the L2 cache. Each increment represents one 64-byte cache line transfer.

Event Select 0x083 Instruction cache refills from system

Abbreviation: IC refills from sys

The number of instruction cache refills from system memory (or another cache). Each increment represents one 64-byte cache line transfer.

Event Select 0x084 L1 ITLB miss and L2 ITLB hit

Abbreviation: ITLB L1M L2H

The number of instruction fetches that miss in the L1 ITLB but hit in the L2 ITLB.

Event Select 0x085 L1 ITLB miss and L2 ITBL miss

Abbreviation: ITLB L1M L2M

The number of instruction fetches that miss in both the L1 and L2 ITLBs.

Event Select 0x086 Pipeline restart due to instruction stream probe

Abbreviation: Restart i-stream probe

The number of pipeline restarts caused by invalidating probes that hit on the instruction stream currently being executed. This would happen if the active instruction stream was being modified by another processor in an MP system - typically a highly unlikely event.

Event Select 0x087 Instruction fetch stall

Abbreviation: Inst fetch stall

The number of cycles the instruction fetcher is stalled. This may be for a variety of reasons such as branch predictor updates, unconditional branch bubbles, far jumps and cache misses, among others. May be overlapped by instruction dispatch stalls or instruction execution, such that these stalls don't necessarily impact performance.

Event Select 0x088 Return stack hits

Abbreviation: RET stack hits

The number of near return instructions (RET or RET Iw) that get their return address from the return address stack (i.e., where the stack has not gone empty). This may include cases where the address is incorrect (return mispredicts). This may also include speculatively executed false-path returns. Return mispredicts are typically caused by the return address stack underflowing, however they may also be caused by an imbalance in calls vs. returns, such as doing a call but then popping the return address off the stack.

Note: This event cannot be reliably compared with events C9h and CAh (such as to calculate percentage of return mispredicts due to an empty return address stack), since it may include speculatively executed false-path returns that are not included in those retire-time events.

Event Select 0x089 Return stack overflows

Abbreviation: RET stack overflows

The number of (near) call instructions that cause the return address stack to overflow. When this happens, the oldest entry is discarded. This count may include speculatively executed calls.

 

 

Execution unit events

Event Select 0x026 Retired CLFLUSH Instructions

Abbreviation: Ret CLFLUSH inst

The number of CLFLUSH instructions retired.

Event Select 0x027 Retired CPUID Instructions

Abbreviation: Ret CPUID inst

The number of CPUID instructions retired.

Event Select 0x076 CPU clocks not halted (cycles)

Abbreviation: CPU clocks

The number of clocks that the CPU is not in a halted state (due to STPCLK or a HALT instruction). Note: this event allows system idle time to be automatically factored out from IPC (or CPI) measurements, providing the OS halts the CPU when going idle. If the OS goes into an idle loop rather than halting, such calculations will be influenced by the IPC of the idle loop.

Event Select 0x0C0 Retired Instructions

Abbreviation: Ret inst

The number of instructions retired (execution completed and architectural state updated). This count includes exceptions and interrupts - each exception or interrupt is counted as one instruction.

Event Select 0x0C1 Retired uops

Abbreviation: Ret uops

The number of micro-ops retired. This includes all processor activity (instructions, exceptions, interrupts, microcode assists, etc.).

Event Select 0x0C2 Retired branch instructions

Abbreviation: Ret branch

The number of branch instructions retired. This includes all types of architectural control flow changes, including exceptions and interrupts.

Event Select 0x0C3 Retired mispredicted branch instructions

Abbreviation: Ret misp branch

The number of branch instructions retired, of any type, that were not correctly predicted. This includes those for which prediction is not attempted (far control transfers, exceptions and interrupts).

Event Select 0x0C4 Retired taken branch instructions

Abbreviation: Ret taken branch

The number of taken branches that were retired. This includes all types of architectural control flow changes, including exceptions and interrupts.

Event Select 0x0C5 Retired taken branch instructions mispredicted

Abbreviation: Ret taken branch misp

The number of retired taken branch instructions that were mispredicted.

Event Select 0x0C6 Retired far control transfers

Abbreviation: Ret far xfers

The number of far control transfers retired including far call/jump/return, IRET, SYSCALL and SYSRET, plus exceptions and interrupts. Far control transfers are not subject to branch prediction.

Event Select 0x0C7 Retired branch resyncs

Abbreviation: Ret branch resyncs

The number of resync branches. These reflect pipeline restarts due to certain microcode assists and events such as writes to the active instruction stream, among other things. Each occurrence reflects a restart penalty similar to a branch mispredict. Relatively rare.

Event Select 0x0C8 Retired near returns

Abbreviation: Ret near RET

The number of near return instructions (RET or RET Iw) retired.

Event Select 0x0C9 Retired near returns mispredicted

Abbreviation: Ret near RET misp

The number of near returns retired that were not correctly predicted by the return address predictor. Each such mispredict incurs the same penalty as a mispredicted conditional branch instruction.

Event Select 0x0CA Retired indirect branches mispredicted

Abbreviation: Ret ind branch misp

The number of indirect branch instructions retired where the target address was not correctly predicted.

Event Select 0x0CB Retired MMX/FP Instructions

Abbreviation: Ret MMX/FP inst

The number of MMX®, SSE or X87 instructions retired. The UNIT_MASK allows the selection of the individual classes of instructions as given in the table. Each increment represents one complete instruction.

Note: Since this event includes non-numeric instructions it is not suitable for measuring MFLOPS.

Value Unit mask description
0 x87 instructions
1 MMX and 3DNow! instructions
2 Packed SSE and SSE2 instructions
3 Scalar SSE and SSE2 instructions

Event Select 0x0CC Retired fastpath double op instructions

Abbreviation: Ret fastpath double op

Value Unit mask description
0 With low op in position 0
1 With low op in position 1
2 With low op in position 2

Event Select 0x0CD Interrupts-masked cycles

Abbreviation: Int-masked cycles

The number of processor cycles where interrupts are masked (EFLAGS.IF = 0). Using edge-counting with this event will give the number of times IF is cleared; dividing the cycle-count value by this value gives the average length of time that interrupts are disabled on each instance. Compare the edge count with event CFh to determine how often interrupts are disabled for interrupt handling vs. other reasons (e.g. critical sections).

Event Select 0x0CE Interrupts-masked cycles with interrupt pending

Abbreviation: Int-masked pending

The number of processor cycles where interrupts are masked (EFLAGS.IF = 0) and an interrupt is pending. Using edge-counting with this event and comparing the resulting count with the edge count for event CDh gives the proportion of interrupts for which handling is delayed due to prior interrupts being serviced, critical sections, etc. The cycle count value gives the total amount of time for such delays. The cycle count divided by the edge count gives the average length of each such delay.

Event Select 0x0CF Interrupts taken

Abbreviation: Int taken

The number of hardware interrupts taken. This does not include software interrupts (INT n instruction).

Event Select 0x0D0 Decoder empty

Abbreviation: Decoder empty

The number of processor cycles where the decoder has nothing to dispatch (typically waiting on an instruction fetch that missed the Icache, or for the target fetch after a branch mispredict).

Event Select 0x0D1 Dispatch stalls

Abbreviation: Dispatch stalls

The number of processor cycles where the decoder is stalled for any reason (has one or more instructions ready but can't dispatch them due to resource limitations in execution). This is the combined effect of events D2h - DAh, some of which may overlap; this event reflects the net stall cycles. The more common stall conditions (events D5h, D6h, D7h, D8h, and to a lesser extent D2) may overlap considerably. The occurrence of these stalls is highly dependent on the nature of the code being executed (instruction mix, memory reference patterns, etc.).

Event Select 0x0D2 Dispatch stalls for branch abort to retire

Abbreviation: Stall branch abort

The number of processor cycles the decoder is stalled waiting for the pipe to drain after a mispredicted branch. This stall occurs if the corrected target instruction reaches the dispatch stage before the pipe has emptied. See also event D1h.

Event Select 0x0D3 Dispatch stall for serialization

Abbreviation: Stall serialization

The number of processor cycles the decoder is stalled due to a serializing operation, which waits for the execution pipeline to drain. Relatively rare; mainly associated with system instructions. See also event D1h.

Event Select 0x0D4 Dispatch stall for segment load

Abbreviation: Stall seg load

The number of processor cycles the decoder is stalled due to a segment load instruction being encountered while execution of a previous segment load operation is still pending. Relatively rare except in 16-bit code. See also event D1h.

Event Select 0x0D5 Dispatch stall for reorder buffer full

Abbreviation: Stall reorder full

The number of processor cycles the decoder is stalled because the reorder buffer is full. May occur simultaneously with certain other stall conditions; see event D1h.

Event Select 0x0D6 Dispatch stall for reservation station full

Abbreviation: Stall res station full

The number of processor cycles the decoder is stalled because a required integer unit reservation stations is full. May occur simultaneously with certain other stall conditions; see event D1h.

Event Select 0x0D7 Dispatch stall for FPU full

Abbreviation: Stall FPU full

The number of processor cycles the decoder is stalled because the scheduler for the Floating Point Unit is full. This condition can be caused by a lack of parallelism in FP-intensive code, or by cache misses on FP operand loads (which could also show up as event D8h instead, depending on the nature of the instruction sequences). May occur simultaneously with certain other stall conditions; see event D1h

Event Select 0x0D8 Dispatch stall for LS full

Abbreviation: Stall LS full

The number of processor cycles the decoder is stalled because the Load/Store Unit is full. This generally occurs due to heavy cache miss activity. May occur simultaneously with certain other stall conditions; see event D1h.

Event Select 0x0D9 Dispatch stall waiting for all quiet

Abbreviation: Stall waiting quiet

The number of processor cycles the decoder is stalled waiting for all outstanding requests to the system to be resolved. Relatively rare; associated with certain system instructions and types of interrupts. May partially overlap certain other stall conditions; see event D1h.

Event Select 0x0DA Dispatch stall for far control transfer or resync to retire

Abbreviation: Stall far/resync

The number of processor cycles the decoder is stalled waiting for the execution pipeline to drain before dispatching the target instructions of a far control transfer or a Resync (an instruction stream restart associated with certain microcode assists). Relatively rare; does not overlap with other stall conditions. See also event D1h.

Event Select 0x0DB FPU exceptions

Abbreviation: FPU except

The number of floating point unit exceptions for microcode assists. The UNIT_MASK may be used to isolate specific types of exceptions.

Value Unit mask description
0 x87 reclass microfaults
1 SSE retype microfaults
2 SSE reclass microfaults
3 SSE and x87 microtraps

Event Select 0x0DC DR0 Breakpoint Matches

Abbreviation: DR0 matches

The number of matches on the address in breakpoint register DR0, per the breakpoint type specified in DR7. The breakpoint does not have to be enabled. Each instruction breakpoint match incurs an overhead of about 120 cycles; load/store breakpoint matches do not incur any overhead.

Event Select 0x0DD DR1 Breakpoint Matches

Abbreviation: DR1 matches

The number of matches on the address in breakpoint register DR1. See notes for event DCh.

Event Select 0x0DE DR2 Breakpoint Matches

Abbreviation: DR2 matches

The number of matches on the address in breakpoint register DR2. See notes for event DCh.

Event Select 0x0DF DR3 Breakpoint Matches

Abbreviation: DR3 matches

The number of matches on the address in breakpoint register DR3. See notes for event DCh.

Memory controller events

Event Select 0x0E0 DRAM accesses

Abbreviation: DRAM accesses

The number of memory accesses performed by the local DRAM controller. The UNIT_MASK may be used to isolate the different DRAM page access cases. Page miss cases incur an extra latency to open a page; page conflict cases incur both a page-close as well as page-open penalties. These penalties may be overlapped by DRAM accesses for other requests and don't necessarily represent lost DRAM bandwidth. The associated penalties are as follows:

Page miss: Trcd (DRAM RAS-to-CAS delay)

Page conflict: Trp + Trcd (DRAM row-precharge time plus RAS-to-CAS delay)

Each DRAM access represents one 64-byte block of data transferred if the DRAM is configured for 64-byte granularity, or one 32-byte block if the DRAM is configured for 32-byte granularity. (The latter is only applicable to single-channel DRAM systems, which may be configured either way.)

Value Unit mask description
0 Page hit
1 Page miss
2 Page conflict

Event Select 0x0E1 Memory controller page table overflows

Abbreviation: Page table overflows

The number of page table overflows in the local DRAM controller. This table maintains information about which DRAM pages are open. An overflow occurs when a request for a new page arrives when the maximum number of pages are already open. Each occurrence reflects an access latency penalty equivalent to a page conflict.

Event Select 0x0E3 Memory controller turnarounds

Abbreviation: Turnarounds

The number of turnarounds on the local DRAM data bus. The UNIT_MASK may be used to isolate the different cases. These represent lost DRAM bandwidth, which may be calculated as follows (in bytes per occurrence):

DIMM turnaround: DRAM_width_in_bytes * 2 edges_per_memclk * 2

R/W turnaround: DRAM_width_in_bytes * 2 edges_per_memclk * 1

R/W turnaround: DRAM_width_in_bytes * 2 edges_per_memclk * (Tcl-1)

where DRAM_width_in_bytes is 8 or 16 (for single- or dual-channel systems), and Tcl is the CAS latency of the DRAM in memory system clock cycles (where the memory clock for DDR-400, or PC3200 DIMMS, for example, would be 200 MHz).

Value Unit mask description
0 DIMM (chip select) turnaround
1 Read to write turnaround
2 Write to read turnaround

Event Select 0x0E4 Memory controller bypass counter saturation

Abbreviation: Bypass ctr sat

Value Unit mask description
0 Memory controller high priority bypass
1 Memory controller low priority bypass
2 DRAM controller interface bypass
3 DRAM controller queue bypass

Event Select 0x0E5 Sized blocks

Abbreviation: Sized blocks

These events provide measures of two unrelated sets of events: DRAM request cancellation activity, and sized read/write block sizes.

Cancel activity: The number of MemCancel requests received by the local memory controller due to dirty probe hits. When a probe for a cache refill or DMA read hits a dirty line, the cache will provide the data inhibit the return of the stale cache line from memory, to conserve system bandwidth. These events reflect the total MemCancel requests seen by the memory controller (UNIT_MASK bit 0), and those requests which actually arrive in time to inhibit the stale data transfer (bit 1). These mask bits should be used separately - the combined count is not particularly meaningful.

Note: Successful cancels may or may not inhibit the DRAM access, depending on whether they arrive soon enough. Such accesses are reflected in event E0h, DRAM Accesses, but it is not possible to isolate this particular component (DRAM accesses that ideally would have been prevented). The upper bound would be the number of cancel requests seen, however this is typically far too pessimistic to be a useful approximation.

Sized Read/Write activity: The Sized Read/Write events reflect 32- or 64-byte transfers (as opposed to other sizes which could be anywhere between 1 and 64 bytes), from either the processor or the Hostbridge (on any node in an MP system). Such accesses from the processor would be due only to write combining buffer flushes, where 32-byte accesses would reflect flushes of partially-filled buffers. Event 65h provides a count of sized write requests associated with WC buffer flushes; comparing that with counts for these events (providing there is very little Hostbridge activity at the same time) will give an indication of how efficiently the write combining buffers are being used. Event 65h may also be useful in factoring out WC flushes when comparing these events with the Upstream Requests component of event ECh.

Value Unit mask description
2 32-byte sized writes (Rev D and later)
3 64-byte sized writes (Rev D and later)
4 32-byte sized reads (Rev D and later)
5 64-byte sized reads (Rev D and later)

Event Select 0x0E8 Thermal Status and ECC Errors

Abbreviation: Thermal/ECC errors

Value Unit mask description
0 Clocks CPU is active when HTC is active
1 Clocks CPU clock is inactive when HTC is active
2 Clocks when die temperature is higher then software high temperature threshold
3 Clocks when high temperature threshold was exceeded
7 Correctable and uncorrectable DRAM ECC Errors Rev E)

Event Select 0x0E9 CPU/IO requests to memory/IO

Abbreviation: CPU/IO req mem/IO

These events reflect request flow between units and nodes, as selected by the UNIT_MASK. The UNIT_MASK is divided into two fields: request type (CPU or I/O access to I/O or Memory) and source/target location (local vs. remote). One or more requests types must be enabled via bits 3:0, and at least one source and one target location must be selected via bits 7:4. Each event reflects a request of the selected type(s) going from the selected source(s) to the selected target(s).

Not all possible paths are supported. The following table shows the UNIT_MASK values that are valid for each request type: Any of the mask values shown may be logically ORed to combine the events. For instance, local CPU requests to both local and remote nodes would be A8h | 98h = B8h. Any CPU to any I/O would be A4h | 94h | 64h = F4h (but remote CPU to remote I/O requests would not be included).

Note: It is not possible to tell from these events how much data is going in which direction, as there is no distinction between reads and writes. Also, particularly for I/O, the requests may be for varying amounts of data, anywhere from one to sixty-four bytes. Event E5h provides an indication of 32- and 64-byte read and write transfers for such requests (although from the target point of view). For a direct measure of the amount and direction of data flowing between nodes, use events F6h, F7h and F8h.

Value Unit mask description
0 I/O to I/O
1 I/O to memory
2 CPU to I/O
3 CPU to memory
4 To remote node
5 To local node
6 From remote node
7 From local node

Event Select 0x0EA Cache block commands

Abbreviation: Cache block cmd

The number of requests made to the system for cache line transfers or coherency state changes, by request type. Each increment represents one cache line transfer, except for Change-to-Dirty. If a Change-to-Dirty request hits on a line in another processor's cache that's in the Owned state, it will cause a cache line transfer, otherwise there is no data transfer associated with Change-to-Dirty requests.

Value Unit mask description
0 Victim block (writeback)
1 Read block (Dcache load miss refill)
2 Read block Shared (ICache refill)
3 Read block Modified (DCache store miss refill)
4 Change to Dirty

Event Select 0x0EB Sized commands

Abbreviation: Sized cmd

The number of Sized Read/Write commands handled by the System Request Interface (local processor and hostbridge interface to the system). These commands may originate from the processor or hostbridge. Typical uses of the various Sized Read/Write commands are given in the UNIT_MASK table. See also event E5h, which covers commonly-used block sizes for these requests, and event ECh, which provides a separate measure of Hostbridge accesses.

Value Unit mask description
0 Non-posted SzWr byte (1-32 bytes)
1 Non-posted SzWr DWORD (1-16 DWORDs)
2 Posted SzWr byte (1-32 bytes)
3 Posted SzWr DWORD (1-16 DWORDs)
4 SzRd byte (4 bytes)
5 SzRd DWORD (1-16 DWORDs)
6 RdModWr

Event Select 0x0EC Probe responses and upstream requests

Abbreviation: Probe resp/up req

This covers two unrelated sets of events: cache probe results, and requests received by the Hostbridge from devices on non-coherent links.

Probe results: These events reflect the results of probes sent from a memory controller to local caches. They provide an indication of the degree data and code is shared between processors (or moved between processors due to process migration). The dirty-hit events indicate the transfer of a 64-byte cache line to the requestor (for a read or cache refill) or the target memory (for a write). The system bandwidth used by these, in terms of bytes per unit of time, may be calculated as 64 times the event count, divided by the elapsed time. Sized writes to memory that cover a full cache line do not incur this cache line transfer -- they simply invalidate the line and are reported as clean hits. Cache line transfers will occur for Change2Dirty requests that hit cache lines in the Owned state. (Such cache lines are counted as Modified-state refills for event 6Ch, System Read Responses.)

Upstream requests: The upstream read and write events reflect requests originating from a device on a local non-coherent HyperTransport link. The two read events allow display refresh traffic in a UMA system to be measured separately from other DMA activity. Display refresh traffic will typically be dominated by 64-byte transfers. Non-display-related DMA accesses may be anywhere from 1 to 64 bytes in size, but may be dominated by a particular size such as 32 or 64 bytes, depending on the nature of the devices. Event E5h can provide a measure of 32- and 64-byte accesses by the hostbridge (possibly combined with write combining buffer flush activity from the processor, although that can be factored out via event 65h).

Value Unit mask description
0 Probe miss
1 Probe hit clean
2 Probe hit dirty without memory cancel
3 Probe hit dirty with memory cancel
4 Upstream display refresh reads
5 Upstream non-display refresh reads
6 Upstream writes (Rev D and later)

Event Select 0x0EE GART events

Abbreviation: GART events

These events reflect GART activity, and in particular allow one to calculate the GART TLB miss ratio as GART_miss_count divided by GART_aperture_hit_count. GART aperture accesses are typically from I/O devices as opposed to the processor, and generally from a 3D graphics accelerator, but can be from other devices when the GART is used as an IO MMU).

Value Unit mask description
0 GART aperture hit on access from CPU
1 GART aperture hit on access from I/O
2 GART miss

 

 

Link events

Event Select 0x0F6 HyperTransport™ link 0 transmit bandwidth

Abbreviation: HT0 bandwidth

The number of dwords transmitted (or unused, in the case of Nops) on the outgoing side of the HyperTransport links. The sum of all four subevents (all four UNIT_MASK bits set) directly reflects the maximum transmission rate of the link. Link utilization may be calculated by dividing the combined Command, Data and Buffer Release count (UNIT_MASK 07h) by that value plus the Nop count (UNIT_MASK 08h). Bandwidth in terms of bytes per unit time for any one component or combination of components is calculated by multiplying the count by four and dividing by elapsed time.

The Data event provides a direct indication of the flow of data around the system. Translating this link-based view into a source/target node based view requires knowledge of the system layout (i.e. which links connect to which nodes).

Value Unit mask description
0 Command DWORD sent
1 Data DWORD sent
2 Buffer release DWORD sent
3 NOP DWORD sent (idle)

Event Select 0x0F7 HyperTransport™ link 1 transmit bandwidth

Abbreviation: HT1 bandwidth

The number of dwords transmitted (or unused, in the case of Nops) on the outgoing side of the HyperTransport links. The sum of all four subevents (all four UNIT_MASK bits set) directly reflects the maximum transmission rate of the link. Link utilization may be calculated by dividing the combined Command, Data and Buffer Release count (UNIT_MASK 07h) by that value plus the Nop count (UNIT_MASK 08h). Bandwidth in terms of bytes per unit time for any one component or combination of components is calculated by multiplying the count by four and dividing by elapsed time.

The Data event provides a direct indication of the flow of data around the system. Translating this link-based view into a source/target node based view requires knowledge of the system layout (i.e. which links connect to which nodes).

Value Unit mask description
0 Command DWORD sent
1 Data DWORD sent
2 Buffer release DWORD sent
3 NOP DWORD sent (idle)

Event Select 0x0F8 HyperTransport™ link 2 transmit bandwidth

Abbreviation: HT2 bandwidth

The number of dwords transmitted (or unused, in the case of Nops) on the outgoing side of the HyperTransport links. The sum of all four subevents (all four UNIT_MASK bits set) directly reflects the maximum transmission rate of the link. Link utilization may be calculated by dividing the combined Command, Data and Buffer Release count (UNIT_MASK 07h) by that value plus the Nop count (UNIT_MASK 08h). Bandwidth in terms of bytes per unit time for any one component or combination of components is calculated by multiplying the count by four and dividing by elapsed time.

The Data event provides a direct indication of the flow of data around the system. Translating this link-based view into a source/target node based view requires knowledge of the system layout (i.e. which links connect to which nodes).

Value Unit mask description
0 Command DWORD sent
1 Data DWORD sent
2 Buffer release DWORD sent
3 NOP DWORD sent (idle)