This section describes processor performance monitor events available for performance analysis and tuning for AMD Family 10h processors. The AMD Family 10h processors provide four 48-bit performance counters per available core, which allows four types of events to be monitored simultaneously. The performance counters are not guaranteed to be fully accurate and should be used as a relative measure of performance to assist in application tuning. Unlisted event numbers are reserved and their results are undefined.
The Event Select value is used to select the event to be monitored. The Unit Mask is used to further qualify the event selected by the Event Select value. The Mask Value given here is an index and corresponds to actual 8-bit Unit Mask as specified in the following table.
Mask value | Unit Mask |
0 | 0x01 |
1 | 0x02 |
2 | 0x04 |
3 | 0x08 |
4 | 0x10 |
5 | 0x20 |
6 | 0x40 |
7 | 0x80 |
Unless otherwise stated, the Unit Mask values shown may be combined to select any desired combination of the sub-events for a given event. For events where no Unit Mask table is shown, the Unit Mask is not applicable.
Speculative vs. Retired events: Several events may include speculative activity, meaning they may be associated with false-path instructions that are ultimately discarded due to a branch misprediction. Events associated with Retire reflect actual program execution. For events where the distinction may matter, these are explicitly labeled as one or the other.
For detailed information, refer to BIOS and Kernel Developer's Guide for AMD Family 10h Processors.
Abbreviation: FPU ops
The number of operations (uops) dispatched to the FPU execution pipelines. This event reflects how busy the FPU pipelines are. This includes all operations done by x87, MMX® and SSE instructions, including moves. Each increment represents a one-cycle dispatch event; packed 128-bit SSE operations count as two ops; scalar operations count as one. Speculative. (See also event CBh).
Note: Since this event includes non-numeric operations it is not suitable for measuring MFLOPs.
Value | Unit mask description |
0 | Add pipe ops excluding junk ops |
1 | Multiply pipe ops excluding junk ops |
2 | Store pipe ops excluding junk ops |
3 | Add pipe load ops |
4 | Multiply pipe load ops |
5 | Store pipe load ops |
Abbreviation: No FPU op cycles
The number of cycles in which no FPU operations were retired. Invert this (set the Invert control bit in the event select MSR) to count cycles in which at least one FPU operation was retired.
Abbreviation: Fast flag FPU ops
The number of FPU operations that use the fast flag interface (e.g. FCOMI, COMISS, COMISD, UCOMISS, UCOMISD). Speculative.
Abbreviation: Retired SSE Ops
The number of SSE operations retired. This counter can count either FLOPS (UnitMask bit 6 = 1) or uops (UnitMask bit 6 = 0).
Value | Unit mask description |
0 | Single precision add/subtract ops |
1 | Single precision multiply ops |
2 | Single precision divide/square root ops |
3 | Double precision add/subtract ops |
4 | Double precision multiply ops |
5 | Double precision divide/square root ops |
6 | Op type: 0=uops. 1=FLOPS |
Abbreviation: Ret move ops
The number of move uops retired. Merging low quadword move ops copy the lower 64 bits of a source register to the upper 64 bits of a destination register. The lower 64 bits of the destination register remain unchanged. Merging high quadword move ops copy the upper 64 bits of a source register to the lower 64 bits of a destination register. The upper 64 bits of the destination register remain unchanged.
Value | Unit mask description |
0 | Merging low quadword move uops |
1 | Merging high quadword move uops |
2 | All other merging move uops |
3 | All other move uops |
Abbreviation: Ret serializing ops
The number of serializing uops retired. A bottom-executing uop is not issued until it is the oldest non-retired uop in the FPU. Bottom-executing ops are most commonly seen with FSTSW and STMXCSR instructions. A bottom-serializing uop does not issue until it is the oldest non-issued uop in the FP scheduler. Bottom-serializing uops block all subsequent uops from being issued until the uop is issued. Bottom-serializing ops are most commonly seen with FLCDW and LDMXCSR instructions.
Value | Unit mask description |
0 | SSE bottom-executing uops retired |
1 | SSE bottom-serializing uops retired |
2 | x87 bottom-executing uops retired |
3 | x87 bottom-serializing uops retired |
Abbreviation: Serialized uop cycles
See EventSelect 0x005 for a description of bottom-executing and bottom-serializing uop.
Value | Unit mask description |
0 | Number of cycles a bottom-execute uop is in the FP scheduler |
1 | Number of cycles a bottom-serializing uop is in the FP scheduler |
Abbreviation: Seg reg loads
The number of segment register loads performed.
Value | Unit mask description |
0 | ES |
1 | CS |
2 | SS |
3 | DS |
4 | FS |
5 | GS |
6 | HS |
Abbreviation: Restart self-mod code
The number of pipeline restarts that were caused by self-modifying code (a store that hits any instruction that's been fetched for execution beyond the instruction doing the store).
Abbreviation: Restart probe hit
The number of pipeline restarts caused by an invalidating probe hitting on a speculative out-of-order load.
Abbreviation: LS buffer2 full
The number of cycles that the LS2 buffer is full. This buffer holds stores waiting to retire as well as requests that missed the data cache and are waiting on a refill. This condition will stall further data cache accesses, although such stalls may be overlapped by independent instruction execution.
Abbreviation: Locked ops
This event covers locked operations performed and their execution time. The execution time represented by the cycle counts is typically overlapped to a large extent with other instructions. The non-speculative cycles event is suitable for event-based profiling of lock operations that tend to miss in the cache.
Value | Unit mask description |
0 | Number of locked instructions executed |
1 | Number of cycles spent in speculative phase |
2 | Number of cycles spent in non-speculative phase |
Abbreviation: Ret CLFLUSH inst
The number of CLFLUSH instructions retired.
Abbreviation: Ret CPUID inst
The number of CPUID instructions retired.
Abbreviation: Cancelled fwd ops
Revision B: Counts the number store to load forward operations that are cancelled.
Value | Unit mask description |
0 | Address mismatches |
1 | Store is smaller than load |
2 | Misaligned |
Abbreviation: SMIs received
Revision B: Counts the number of SMI’s received by the processor.
Abbreviation: DC accesses
The number of accesses to the data cache for load and store references. This may include certain microcode scratchpad accesses, although these are generally rare. Each increment represents an eight-byte access, although the instruction may only be accessing a portion of that. Speculative.
Abbreviation: DC misses
The number of data cache references which missed in the data cache. Speculative.
Except in the case of streaming stores, only the first miss for a given line is included - access attempts by other instructions while the refill is still pending are not included in this event. So in the absence of streaming stores, each event reflects one 64-byte cache line refill, and counts of this event are the same as, or very close to, the combined count for event 42h.
Streaming stores however will cause this event for every such store, since the target memory is not refilled into the cache. Hence this event should not be used as an indication of data cache refill activity - event 42h should be used for such measurements. (See event 65h for an indication of streaming store activity.) A large difference between events 41h (with all UNIT_MASK bits set) and 42h would be due mainly to streaming store activity.
Abbreviation: DC refills L2/NB
The number of data cache refills satisfied from the L2 cache (and/or the Northbridge), per the UNIT_MASK. UNIT_MASK bits 4:1 allow a breakdown of refills from the L2 by coherency state. UNIT_MASK bit 0 reflects refills which missed in the L2, and provides the same measure as the combined sub-events of event 43h. Each increment reflects a 64-byte transfer. Speculative.
Value | Unit mask description |
0 | Refill from Northbridge |
1 | Shared-state line from L2 |
2 | Exclusive-state line from L2 |
3 | Owned-state line from L2 |
4 | Modified-state line from L2 |
Abbreviation: DC refills NB
The number of L1 cache refills satisfied from the Northbridge (DRAM, L3 cache, or another processor's cache), as opposed to the L2. The UNIT_MASK selects lines in one or more specific coherency states. Each increment reflects a 64-byte transfer. Speculative.
Value | Unit mask description |
0 | Invalid |
1 | Shared |
2 | Exclusive |
3 | Owned |
4 | Modified |
Abbreviation: DC evicted
The number of L1 data cache lines written to the L2 cache or system memory, having been displaced by L1 refills. The UNIT_MASK may be used to count only victims in specific coherency states. Each increment represents a 64-byte transfer. Speculative.
In most cases, L1 victims are moved to the L2 cache, displacing an older cache line there. Lines brought into the data cache by PrefetchNTA instructions, however, are evicted directly to system memory (if dirty) or invalidated (if clean). The Invalid case (UnitMask value 01h) reflects the replacement of lines that would have been invalidated by probes for write operations from another processor or DMA activity. UnitMask 20h and 40h count all evictions regardless of cache line state. When either UnitMask 20h or 40h is enabled all other Unit- Masks should be disabled.
Value | Unit mask description |
0 | Invalid |
1 | Shared |
2 | Exclusive |
3 | Owned |
4 | Modified |
Abbreviation: DTLB L1M L2H
The number of data cache accesses that miss in the L1 DTLB and hit in the L2 DTLB. Speculative.
Value | Unit mask description |
0 | L2 4K TLB hit |
1 | L2 2M TLB hit |
Abbreviation: DTLB L1M L2M
The number of data cache accesses that miss in both the L1 and L2 DTLBs. Speculative.
Value | Unit mask description |
0 | 4K TLB reload |
1 | 2M TLB reload |
2 | 1G TLB reload |
Abbreviation: Misalign access
The number of data cache accesses that are misaligned. These are accesses which cross an eight-byte boundary. They incur an extra cache access (reflected in event 40h), and an extra cycle of latency on reads. Speculative.
Abbreviation: Late cancel
Abbreviation: Early cancel
Abbreviation: 1-bit ECC errors
The number of single-bit errors corrected by either of the error detection/correction mechanisms in the data cache.
Value | Unit mask description |
0 | Scrubber error |
1 | Piggyback scrubber errors |
2 | Load pipe error |
3 | Store write pipe error |
Abbreviation: Prefetch inst
The number of prefetch instructions dispatched by the decoder. Such instructions may or may not cause a cache line transfer. All Dcache and L2 accesses, hits and misses by prefetch instructions, except for prefetch instructions that collide with an outstanding hardware prefetch, are included in these events. This event is a speculative event.
Value | Unit mask description |
0 | Load (Prefetch, PrefetchT0/T1/T2) |
1 | Store (PrefetchW) |
2 | NTA (PrefetchNTA) |
Abbreviation: DC misses locked inst
The number of data cache misses incurred by locked instructions. (The total number of locked instructions may be obtained from event 24h.)
Such misses may be satisfied from the L2 or system memory, but there is no provision for distinguishing between the two. When used for event-based profiling, this event will tend to occur very close to the offending instructions. (See also event 24h.) This event is also included in the basic Dcache miss event (41h).
Value | Unit mask description |
1 | Data cache misses by locked instructions |
Abbreviation: L1 DTLB hit
The number of data cache accesses that hit in the L1 DTLB. This event is a speculative event.
Value | Unit mask description |
0 | L1 4K TLB hit |
1 | L1 2M TLB hit |
2 | L1 1G TLB hit |
Abbreviation: Ineffective SW prefetch
The number of software prefetches that did not fetch data outside of the processor core.
Value | Unit mask description |
0 | Software prefetch hit in the L1 |
3 | Software prefetch hit in the L2 |
Abbreviation: Global TLB flushes
This event counts TLB flushes that flush TLB entries that have the global bit set.
Abbreviation: Memory req
These events reflect accesses to uncachable (UC) or write-combining (WC) memory regions (as defined by MTRR or PAT settings) and Streaming Store activity to WB memory. Both the WC and Streaming Store events reflect Write Combining buffer flushes, not individual store instructions. WC buffer flushes which typically consist of one 64-byte write to the system for each flush (assuming software typically fills a buffer before it gets flushed). A partially-filled buffer will require two or more smaller writes to the system. The WC event reflects flushes of WC buffers that were filled by stores to WC memory or streaming stores to WB memory. The Streaming Store event reflects only flushes due to streaming stores (which are typically only to WB memory). The difference between counts of these two events reflects the true amount of write events to WC memory.
Value | Unit mask description |
0 | Requests to non-cacheable (UC) memory |
1 | Requests to write-combining (WC) memory |
7 | Streaming store (SS) requests |
Abbreviation: Data prefetcher
These events reflect requests made by the data prefetcher. UNIT_MASK bit 1 counts total prefetch requests, while bit 0 counts requests where the target block is found in the L2 or data cache. The difference between the two represents actual data read (in units of 64-byte cache lines) from the system by the prefetcher. This is also included in the count of event 7Fh, UNIT_MASK bit 0 (combined with other L2 fill events).
Value | Unit mask description |
0 | Cancelled prefetches |
1 | Prefetch attempts |
Abbreviation: NB read resp coh state
The number of responses from the Northbridge for cache refill requests. The UnitMask may be used to select specific cache coherency states. Each increment represents one 64-byte cache line transferred from the Northbridge (DRAM, L3, or another cache, including another core on the same node) to the data cache, instruction cache or L2 cache (for data prefetcher and TLB table walks). Modified-state responses may be for Dcache store miss refills, PrefetchW software prefetches, hardware prefetches for a store-miss stream, or Change-to- Dirty requests that get a dirty (Owned) probe hit in another cache. Exclusive responses may be for any Icache refill, Dcache load miss refill, other software prefetches, hardware prefetches for a load-miss stream, or TLB table walks that miss in the L2 cache; Shared responses may be for any of those that hit a clean line in another cache. Modified-state responses may be for Dcache store miss refills, PrefetchW software prefetches, hardware prefetches for a store-miss stream, or Change-to-Dirty requests that get a dirty (Owned) probe hit in another cache. Exclusive responses may be for any Icache refill, Dcache load miss refill, other software prefetches, hardware prefetches for a load-miss stream, or TLB table walks that miss in the L2 cache; Shared responses may be for any of those that hit a clean line in another cache.
Value | Unit mask description |
0 | Exclusive |
1 | Modified |
2 | Shared |
4 | Data error |
Abbreviation: Octwords written to sys
The number of octword (16-byte) data transfers from the processor to the system. These may be part of a 64- byte cache line writeback or a 64-byte dirty probe hit response, each of which would cause four increments; or a partial or complete Write Combining buffer flush (Sized Write), which could cause from one to eight increments.
Value | Unit mask description |
0 | Octword write transfer |
Abbreviation: L2 requests
The number of requests to the L2 cache for Icache or Dcache fills, or page table lookups for the TLB. These events reflect only read requests to the L2; writes to the L2 are indicated by event 7Fh. These include some amount of retries associated with address or resource conflicts. Such retries tend to occur more as the L2 gets busier, and in certain extreme cases (such as large block moves that overflow the L2) these extra requests can dominate the event count.
These extra requests are not a direct indication of performance impact - they simply reflect opportunistic accesses that don't complete. But because of this, they are not a good indication of actual cache line movement. The Icache and Dcache miss and refill events (81h, 82h, 83h, 41h, 42h, 43h) provide a more accurate indication of this, and are the preferred way to measure such traffic.
Value | Unit mask description |
0 | IC fill |
1 | DC fill |
2 | TLB fill (page table walks) |
3 | Tag snoop request |
4 | Cancelled request |
5 | Hardware prefetch from DC |
Abbreviation: L2 misses
The number of requests that miss in the L2 cache. This may include some amount of speculative activity, as well as some amount of retried requests as described in event 7Dh. The IC-fill-miss and DC-fill-miss events tend to mirror the Icache and Dcache refill-from-system events (83h and 43h, respectively), and tend to include more speculative activity than those events.
Value | Unit mask description |
0 | IC fill |
1 | DC fill (includes possible replays) |
2 | TLB page table walk |
3 | Hardware prefetch from DC |
Abbreviation: L2 fill/writeback
The number of lines written into the L2 cache due to victim writebacks from the Icache or Dcache, TLB page table walks and the hardware data prefetcher (UNIT_MASK bit 0); or writebacks of dirty lines from the L2 to the system (UNIT_MASK bit 1). Each increment represents a 64-byte cache line transfer.
Note: Victim writebacks from the Dcache may be measured separately using event 44h. However this is not quite the same as the Dcache component of event 7Fh, the main difference being PrefetchNTA lines. When these are evicted from the Dcache due to replacement, they are written out to system memory (if dirty) or simply invalidated (if clean), rather than being moved to the L2 cache.
Value | Unit mask description |
0 | L2 fills |
1 | L2 writebacks to system |
Abbreviation: IC fetches
The number of instruction cache accesses by the instruction fetcher. Each access is an aligned 16 byte read, from which a varying number of instructions may be decoded.
Abbreviation: IC misses
The number of instruction fetches that miss in the instruction cache. This is typically equal to or very close to the sum of events 82h and 83h. Each miss results in a 64-byte cache line refill.
Abbreviation: IC refills from L2
The number of instruction cache refills satisfied from the L2 cache. Each increment represents one 64-byte cache line transfer.
Abbreviation: IC refills from sys
The number of instruction cache refills from system memory (or another cache). Each increment represents one 64-byte cache line transfer.
Abbreviation: ITLB L1M L2H
The number of instruction fetches that miss in the L1 ITLB but hit in the L2 ITLB.
Abbreviation: ITLB L1M L2M
The number of instruction fetches that miss in both the L1 and L2 ITLBs.
Value | Unit mask description |
0 | Instruction fetches to a 4K page |
1 | Instruction fetches to a 2M page |
Abbreviation: Restart i-stream probe
The number of pipeline restarts caused by invalidating probes that hit on the instruction stream currently being executed. This would happen if the active instruction stream was being modified by another processor in an MP system - typically a highly unlikely event.
Abbreviation: Inst fetch stall
The number of cycles the instruction fetcher is stalled. This may be for a variety of reasons such as branch predictor updates, unconditional branch bubbles, far jumps and cache misses, among others. May be overlapped by instruction dispatch stalls or instruction execution, such that these stalls don't necessarily impact performance.
Abbreviation: RET stack hits
The number of near return instructions (RET or RET Iw) that get their return address from the return address stack (i.e. where the stack has not gone empty). This may include cases where the address is incorrect (return mispredicts). This may also include speculatively executed false-path returns. Return mispredicts are typically caused by the return address stack underflowing, however they may also be caused by an imbalance in calls vs. returns, such as doing a call but then popping the return address off the stack.
Note: This event cannot be reliably compared with events C9h and CAh (such as to calculate percentage of return mispredicts due to an empty return address stack), since it may include speculatively executed false-path returns that are not included in those retire-time events.
Abbreviation: RET stack overflows
The number of (near) call instructions that cause the return address stack to overflow. When this happens, the oldest entry is discarded. This count may include speculatively executed calls.
Abbreviation: IC victims
The number of cache lines evicted from the instruction cache to the L2.
Abbreviation: IC lines invalid
The number of instruction cache lines invalidated.
Value | Unit mask description |
0 | Non-SMC that missed on in-flight instructions. |
1 | Non-SMC that hit on in-flight instructions |
2 | SMC that missed on in-flight instructions. |
3 | SMC that hit on in-flight instructions |
Abbreviation: ITLB reloads
The number of ITLB reload requests.
Abbreviation: ITLB reloads aborted
The number of ITLB reloads aborted.
Abbreviation: CPU clocks
The number of clocks that the CPU is not in a halted state (due to STPCLK or a HALT instruction). Note: this event allows system idle time to be automatically factored out from IPC (or CPI) measurements, providing the OS halts the CPU when going idle. If the OS goes into an idle loop rather than halting, such calculations will be influenced by the IPC of the idle loop.
Abbreviation: Ret inst
The number of instructions retired (execution completed and architectural state updated). This count includes exceptions and interrupts - each exception or interrupt is counted as one instruction.
Abbreviation: Ret uops
The number of micro-ops retired. This includes all processor activity (instructions, exceptions, interrupts, microcode assists, etc.).
Abbreviation: Ret branch
The number of branch instructions retired. This includes all types of architectural control flow changes, including exceptions and interrupts.
Abbreviation: Ret misp branch
The number of branch instructions retired, of any type, that were not correctly predicted. This includes those for which prediction is not attempted (far control transfers, exceptions and interrupts).
Abbreviation: Ret taken branch
The number of taken branches that were retired. This includes all types of architectural control flow changes, including exceptions and interrupts.
Abbreviation: Ret taken branch misp
The number of retired taken branch instructions that were mispredicted.
Abbreviation: Ret far xfers
The number of far control transfers retired including far call/jump/return, IRET, SYSCALL and SYSRET, plus exceptions and interrupts. Far control transfers are not subject to branch prediction.
Abbreviation: Ret branch resyncs
The number of resync branches. These reflect pipeline restarts due to certain microcode assists and events such as writes to the active instruction stream, among other things. Each occurrence reflects a restart penalty similar to a branch mispredict. Relatively rare.
Abbreviation: Ret near RET
The number of near return instructions (RET or RET Iw) retired.
Abbreviation: Ret near RET misp
The number of near returns retired that were not correctly predicted by the return address predictor. Each such mispredict incurs the same penalty as a mispredicted conditional branch instruction.
Abbreviation: Ret ind branch misp
The number of indirect branch instructions retired where the target address was not correctly predicted.
Abbreviation: Ret MMX/FP inst
The number of MMX®, SSE or X87 instructions retired. The UNIT_MASK allows the selection of the individual classes of instructions as given in the table. Each increment represents one complete instruction.
Note: Since this event includes non-numeric instructions it is not suitable for measuring MFLOPS.
Value | Unit mask description |
0 | x87 instructions |
1 | MMX and 3DNow instructions |
2 | SSE and SSE2 instructions |
Abbreviation: Ret fastpath double op
Value | Unit mask description |
0 | With low op in position 0 |
1 | With low op in position 1 |
2 | With low op in position 2 |
Abbreviation: Int-masked cycles
The number of processor cycles where interrupts are masked (EFLAGS.IF = 0). Using edge-counting with this event will give the number of times IF is cleared; dividing the cycle-count value by this value gives the average length of time that interrupts are disabled on each instance. Compare the edge count with event CFh to determine how often interrupts are disabled for interrupt handling vs. other reasons (e.g. critical sections).
Abbreviation: Int-masked pending
The number of processor cycles where interrupts are masked (EFLAGS.IF = 0) and an interrupt is pending. Using edge-counting with this event and comparing the resulting count with the edge count for event CDh gives the proportion of interrupts for which handling is delayed due to prior interrupts being serviced, critical sections, etc. The cycle count value gives the total amount of time for such delays. The cycle count divided by the edge count gives the average length of each such delay.
Abbreviation: Int taken
The number of hardware interrupts taken. This does not include software interrupts (INT n instruction).
Abbreviation: Decoder empty
The number of processor cycles where the decoder has nothing to dispatch (typically waiting on an instruction fetch that missed the Icache, or for the target fetch after a branch mispredict).
Abbreviation: Dispatch stalls
The number of processor cycles where the decoder is stalled for any reason (has one or more instructions ready but can't dispatch them due to resource limitations in execution). This is the combined effect of events D2h - DAh, some of which may overlap; this event reflects the net stall cycles. The more common stall conditions (events D5h, D6h, D7h, D8h, and to a lesser extent D2) may overlap considerably. The occurrence of these stalls is highly dependent on the nature of the code being executed (instruction mix, memory reference patterns, etc.).
Abbreviation: Stall branch abort
The number of processor cycles the decoder is stalled waiting for the pipe to drain after a mispredicted branch. This stall occurs if the corrected target instruction reaches the dispatch stage before the pipe has emptied. See also event D1h.
Abbreviation: Stall serialization
The number of processor cycles the decoder is stalled due to a serializing operation, which waits for the execution pipeline to drain. Relatively rare; mainly associated with system instructions. See also event D1h.
Abbreviation: Stall seg load
The number of processor cycles the decoder is stalled due to a segment load instruction being encountered while execution of a previous segment load operation is still pending. Relatively rare except in 16-bit code. See also event D1h.
Abbreviation: Stall reorder full
The number of processor cycles the decoder is stalled because the reorder buffer is full. May occur simultaneously with certain other stall conditions; see event D1h.
Abbreviation: Stall res station full
The number of processor cycles the decoder is stalled because a required integer unit reservation stations is full. May occur simultaneously with certain other stall conditions; see event D1h.
Abbreviation: Stall FPU full
The number of processor cycles the decoder is stalled because the scheduler for the Floating Point Unit is full. This condition can be caused by a lack of parallelism in FP-intensive code, or by cache misses on FP operand loads (which could also show up as event D8h instead, depending on the nature of the instruction sequences). May occur simultaneously with certain other stall conditions; see event D1h
Abbreviation: Stall LS full
The number of processor cycles the decoder is stalled because the Load/Store Unit is full. This generally occurs due to heavy cache miss activity. May occur simultaneously with certain other stall conditions; see event D1h.
Abbreviation: Stall waiting quiet
The number of processor cycles the decoder is stalled waiting for all outstanding requests to the system to be resolved. Relatively rare; associated with certain system instructions and types of interrupts. May partially overlap certain other stall conditions; see event D1h.
Abbreviation: Stall far/resync
The number of processor cycles the decoder is stalled waiting for the execution pipeline to drain before dispatching the target instructions of a far control transfer or a Resync (an instruction stream restart associated with certain microcode assists). Relatively rare; does not overlap with other stall conditions. See also event D1h.
Abbreviation: FPU except
The number of floating point unit exceptions for microcode assists. The UNIT_MASK may be used to isolate specific types of exceptions.
Value | Unit mask description |
0 | x87 reclass microfaults |
1 | SSE retype microfaults |
2 | SSE reclass microfaults |
3 | SSE and x87 microtraps |
Abbreviation: DR0 matches
The number of matches on the address in breakpoint register DR0, per the breakpoint type specified in DR7. The breakpoint does not have to be enabled. Each instruction breakpoint match incurs an overhead of about 120 cycles; load/store breakpoint matches do not incur any overhead.
Abbreviation: DR1 matches
The number of matches on the address in breakpoint register DR1. See notes for event DCh.
Abbreviation: DR2 matches
The number of matches on the address in breakpoint register DR2. See notes for event DCh.
Abbreviation: DR3 matches
The number of matches on the address in breakpoint register DR3. See notes for event DCh.
Abbreviation: DRAM accesses
The number of memory accesses performed by the local DRAM controller. The UNIT_MASK may be used to isolate the different DRAM page access cases. Page miss cases incur an extra latency to open a page; page conflict cases incur both a page-close as well as page-open penalties. These penalties may be overlapped by DRAM accesses for other requests and don't necessarily represent lost DRAM bandwidth. The associated penalties are as follows:
Page miss: Trcd (DRAM RAS-to-CAS delay)
Page conflict: Trp + Trcd (DRAM row-precharge time plus RAS-to-CAS delay)
Each DRAM access represents one 64-byte block of data transferred if the DRAM is configured for 64-byte granularity, or one 32-byte block if the DRAM is configured for 32-byte granularity. (The latter is only applicable to single-channel DRAM systems, which may be configured either way.)
Value | Unit mask description |
0 | DCT0 Page hit |
1 | DCT0 Page miss |
2 | DCT0 Page conflict |
3 | DCT1 Page hit |
4 | DCT1 Page miss |
5 | DCT1 Page conflict |
6 | Write request |
7 | Read request |
Abbreviation: Page table overflows
The number of page table overflows in the local DRAM controller. This table maintains information about which DRAM pages are open. An overflow occurs when a request for a new page arrives when the maximum number of pages are already open. Each occurrence reflects an access latency penalty equivalent to a page conflict.
Value | Unit mask description |
0 | DCT0 page table overflow |
1 | DCT1 page table overflow |
Abbreviation: DRAM cmd slot miss
Value | Unit mask description |
0 | DCT0 command slots missed |
1 | DCT1 command slots missed |
Abbreviation: Turnarounds
The number of turnarounds on the local DRAM data bus. The UNIT_MASK may be used to isolate the different cases. These represent lost DRAM bandwidth, which may be calculated as follows (in bytes per occurrence):
DIMM turnaround: DRAM_width_in_bytes * 2 edges_per_memclk * 2
R/W turnaround: DRAM_width_in_bytes * 2 edges_per_memclk * 1
R/W turnaround: DRAM_width_in_bytes * 2 edges_per_memclk * (Tcl-1)
where DRAM_width_in_bytes is 8 or 16 (for single- or dual-channel systems), and Tcl is the CAS latency of the DRAM in memory system clock cycles (where the memory clock for DDR-400, or PC3200 DIMMS, for example, would be 200 MHz).
Value | Unit mask description |
0 | DCT0 DIMM (chip select) turnaround |
1 | DCT0 Read to write turnaround |
2 | DCT0 Write to read turnaround |
3 | DCT1 DIMM (chip select) turnaround |
4 | DCT1 Read to write turnaround |
5 | DCT1 Write to read turnaround |
Abbreviation: Bypass ctr sat
Value | Unit mask description |
0 | Memory controller high priority bypass |
1 | Memory controller medium priority bypass |
2 | DCT0 DCQ bypass |
3 | DCT1 DCQ bypass |
Abbreviation: Thermal status
Value | Unit mask description |
2 | Number of times the HTC trip point is crossed |
3 | Number of clocks when STC trip point active |
4 | Number of times the STC trip point is crossed |
5 | Number of clocks HTC P-state is inactive |
6 | Number of clocks HTC P-state is active |
Abbreviation: CPU/IO req mem/IO
These events reflect request flow between units and nodes, as selected by the UNIT_MASK. The UNIT_MASK is divided into two fields: request type (CPU or I/O access to I/O or Memory) and source/target location (local vs. remote). One or more requests types must be enabled via bits 3:0, and at least one source and one target location must be selected via bits 7:4. Each event reflects a request of the selected type(s) going from the selected source(s) to the selected target(s).
Not all possible paths are supported. The following table shows the UNIT_MASK values that are valid for each request type: Any of the mask values shown may be logically ORed to combine the events. For instance, local CPU requests to both local and remote nodes would be A8h | 98h = B8h. Any CPU to any I/O would be A4h | 94h | 64h = F4h (but remote CPU to remote I/O requests would not be included).
Note: It is not possible to tell from these events how much data is going in which direction, as there is no distinction between reads and writes. Also, particularly for I/O, the requests may be for varying amounts of data, anywhere from one to sixty-four bytes. Event E5h provides an indication of 32- and 64-byte read and write transfers for such requests (although from the target point of view). For a direct measure of the amount and direction of data flowing between nodes, use events F6h, F7h and F8h.
Value | Unit mask description |
0 | I/O to I/O |
1 | I/O to memory |
2 | CPU to I/O |
3 | CPU to Memory |
4 | To remote node |
5 | To local node |
6 | From remote node |
7 | From local node |
Abbreviation: Cache block cmd
The number of requests made to the system for cache line transfers or coherency state changes, by request type. Each increment represents one cache line transfer, except for Change-to-Dirty. If a Change-to-Dirty request hits on a line in another processor's cache that's in the Owned state, it will cause a cache line transfer, otherwise there is no data transfer associated with Change-to-Dirty requests.
Value | Unit mask description |
0 | Victim block (writeback) |
2 | Read block (Dcache load miss refill) |
3 | Read block shared (Icache refill) |
4 | Read block modified (Dcache store miss refill) |
5 | Change to Dirty (first store to clean block in cache) |
Abbreviation: Sized cmd
The number of Sized Read/Write commands handled by the System Request Interface (local processor and host bridge interface to the system). These commands may originate from the processor or host bridge. Typical uses of the various Sized Read/Write commands are given in the UNIT_MASK table. See also event E5h, which covers commonly-used block sizes for these requests, and event ECh, which provides a separate measure of Host bridge accesses.
Value | Unit mask description |
0 | NonPosted SzWr byte |
1 | NonPosted SzWr DWORD |
2 | Posted SzWr byte |
3 | Posted SzWr DWORD |
4 | SzRd byte |
5 | SzRd DWORD |
Abbreviation: Probe resp/up req
This covers two unrelated sets of events: cache probe results, and requests received by the Host bridge from devices on non-coherent links.
Probe results: These events reflect the results of probes sent from a memory controller to local caches. They provide an indication of the degree data and code is shared between processors (or moved between processors due to process migration). The dirty-hit events indicate the transfer of a 64-byte cache line to the requestor (for a read or cache refill) or the target memory (for a write). The system bandwidth used by these, in terms of bytes per unit of time, may be calculated as 64 times the event count, divided by the elapsed time. Sized writes to memory that cover a full cache line do not incur this cache line transfer -- they simply invalidate the line and are reported as clean hits. Cache line transfers will occur for Change2Dirty requests that hit cache lines in the Owned state. (Such cache lines are counted as Modified-state refills for event 6Ch, System Read Responses.)
Upstream requests: The upstream read and write events reflect requests originating from a device on a local non-coherent HyperTransport™ link. The two read events allow display refresh traffic in a UMA system to be measured separately from other DMA activity. Display refresh traffic will typically be dominated by 64-byte transfers. Non-display-related DMA accesses may be anywhere from 1 to 64 bytes in size, but may be dominated by a particular size such as 32 or 64 bytes, depending on the nature of the devices. Event E5h can provide a measure of 32- and 64-byte accesses by the host bridge (possibly combined with write combining buffer flush activity from the processor, although that can be factored out via event 65h).
Value | Unit mask description |
0 | Probe miss |
1 | Probe hit clean |
2 | Probe hit dirty without memory cancel |
3 | Probe hit dirty with memory cancel |
4 | Upstream display refresh/ISOC reads |
5 | Upstream non-display refresh reads |
6 | Upstream ISOC writes |
7 | Upstream non-ISOC writes |
Abbreviation: GART events
These events reflect GART activity, and in particular allow one to calculate the GART TLB miss ratio as GART_miss_count divided by GART_aperture_hit_count. GART aperture accesses are typically from I/O devices as opposed to the processor, and generally from a 3D graphics accelerator, but can be from other devices when the GART is used as an IO MMU).
Value | Unit mask description |
0 | GART aperture hit on access from CPU |
1 | GART aperture hit on access from I/O |
2 | GART miss |
3 | GART/DEV Request hit table walk in progress |
4 | DEV hit |
5 | DEV miss |
6 | DEV error |
7 | GART/DEV multiple table walk in progress |
Abbreviation: MCT Requests
Read/Write requests: The read/write request events reflect the total number of commands sent to the DRAM controller.
Sized Read/Write activity: The Sized Read/Write events reflect 32- or 64-byte transfers (as opposed to other sizes which could be anywhere between 1 and 64 bytes), from either the processor or the Host bridge (on any node in an MP system). Such accesses from the processor would be due only to write combining buffer flushes, where 32-byte accesses would reflect flushes of partially-filled buffers. Event 65h provides a count of sized write requests associated with WC buffer flushes; comparing that with counts for these events (providing there is very little Host bridge activity at the same time) gives an indication of how efficiently the write combining buffers are being used. Event 65h may also be useful in factoring out WC flushes when comparing these events with the Upstream Requests component of event ECh.
Value | Unit mask description |
0 | Write requests sent to the DCT |
1 | Read requests (including prefetch requests) sent to the DCT |
2 | Prefetch requests sent to the DCT |
3 | 32 Bytes Sized Writes |
4 | 64 Bytes Sized Writes |
5 | 32 Bytes Sized Reads |
6 | 64 Byte Sized Reads |
Abbreviation: CPU to DRAM req to node
This event counts all DRAM reads and writes generated by cores on the local node to the targeted node in the coherent fabric. This counter can be used to observe processor data affinity in NUMA aware operating systems.
Value | Unit mask description |
0 | From Local node to Node 0 |
1 | From Local node to Node 1 |
2 | From Local node to Node 2 |
3 | From Local node to Node 3 |
4 | From Local node to Node 4 |
5 | From Local node to Node 5 |
6 | From Local node to Node 6 |
7 | From Local node to Node 7 |
Abbreviation: IO to DRAM req to node
This event counts all DRAM reads and writes generated by IO devices attached to the IO links of the local node the targeted node in the coherent fabric. This counter can be used to observe IO device data affinity in NUMA aware operating systems.
Value | Unit mask description |
0 | From Local node to Node 0 |
1 | From Local node to Node 1 |
2 | From Local node to Node 2 |
3 | From Local node to Node 3 |
4 | From Local node to Node 4 |
5 | From Local node to Node 5 |
6 | From Local node to Node 6 |
7 | From Local node to Node 7 |
Abbreviation: CPU read cmd lat to node 0-3
This event counts the number of NB clocks from when the targeted command is received in the NB to when the targeted command completes. This event only tracks one outstanding command at a time. To determine latency between the local node and a remote node set UnitMask[7:4] to select the node and UnitMask[3:0] to select the read command type. The count returned by the counter should be divided by the count returned by EventSelect 1E3h do determine the average latency for the command type.
Value | Unit mask description |
0 | Read block |
1 | Read block shared |
2 | Read block modified |
3 | Change to Dirty |
4 | From Local node to Node 0 |
5 | From Local node to Node 1 |
6 | From Local node to Node 2 |
7 | From Local node to Node 3 |
Abbreviation: CPU read cmd req to node 0-3
This event counts the number of requests that a latency measurement is made for using EventSelect 1E2h. To determine the number of commands that a latency measurement are made for between the local node and a remote node set UnitMask[7:4] to select the node and UnitMask[3:0] to select the read command type.
Value | Unit mask description |
0 | Read block |
1 | Read block shared |
2 | Read block modified |
3 | Change to Dirty |
4 | From Local node to Node 0 |
5 | From Local node to Node 1 |
6 | From Local node to Node 2 |
7 | From Local node to Node 3 |
Abbreviation: CPU read cmd lat to node 4-7
This event counts the number of NB clocks from when the targeted command is received in the NB to when the targeted command completes. This event only tracks one outstanding command at a time. To determine latency between the local node and a remote node set UnitMask[7:4] to select the node and UnitMask[3:0] to select the read command type. The count returned by the counter should be divided by the count returned by EventSelect 1E5h do determine the average latency for the command type.
Value | Unit mask description |
0 | Read block |
1 | Read block shared |
2 | Read block modified |
3 | Change to Dirty |
4 | From Local node to Node 4 |
5 | From Local node to Node 5 |
6 | From Local node to Node 6 |
7 | From Local node to Node 7 |
Abbreviation: CPU read cmd req to node 4-7
This event counts the number of requests that a latency measurement is made for using EventSelect 1E4h. To determine the number of commands that a latency measurement are made for between the local node and a remote node set UnitMask[7:4] to select the node and UnitMask[3:0] to select the read command type.
Value | Unit mask description |
0 | Read block |
1 | Read block shared |
2 | Read block modified |
3 | Change to Dirty |
4 | From Local node to Node 4 |
5 | From Local node to Node 5 |
6 | From Local node to Node 6 |
7 | From Local node to Node 7 |
Abbreviation: CPU cmd lat to node 0-3/4-7
This event counts the number of NB clocks from when the targeted command is received in the NB to when the targeted command completes. This event only tracks one outstanding command at a time. To determine latency between the local node and a remote node set UnitMask[7:4] to select the node, UnitMask[3] to select the node group and UnitMask[3:0] to select the command type. The count returned by the counter should be divided by the count returned by EventSelect 1E7h do determine the average latency for the command type.
Value | Unit mask description |
0 | Read Sized |
1 | Write Sized |
2 | Victim Block |
3 | Node Group Select: 0=Nodes 0-3, 1=Nodes 4-7 |
4 | From Local node to Node 0/4 |
5 | From Local node to Node 1/5 |
6 | From Local node to Node 2/6 |
7 | From Local node to Node 3/7 |
Abbreviation: CPU req to node 0-3/4-7
This event counts the number of requests that a latency measurement is made for using EventSelect 1E6h. To determine the number of commands that a latency measurement are made for between the local node and a remote node set UnitMask[7:4] to select the node, UnitMask[3] to select the node group and UnitMask[3:0] to select the command type.
Value | Unit mask description |
0 | Read Sized |
1 | Write Sized |
2 | Victim Block |
3 | Node Group Select: 0=Nodes 0-3, 1= Nodes 4-7 |
4 | From Local node to Node 0/4 |
5 | From Local node to Node 1/5 |
6 | From Local node to Node 2/6 |
7 | From Local node to Node 3/7 |
Abbreviation: HT0 bandwidth
The number of Dwords transmitted (or unused, in the case of Nops) on the outgoing side of the HyperTransport links. The sum of all four subevents (all four UNIT_MASK bits set) directly reflects the maximum transmission rate of the link. Link utilization may be calculated by dividing the combined Command, Data and Buffer Release count (UNIT_MASK 07h) by that value plus the Nop count (UNIT_MASK 08h). Bandwidth in terms of bytes per unit time for any one component or combination of components is calculated by multiplying the count by four and dividing by elapsed time.
The Data event provides a direct indication of the flow of data around the system. Translating this link-based view into a source/target node based view requires knowledge of the system layout (i.e. which links connect to which nodes).
UnitMask[7] specifies the sublink to count if the link is unganged.
Value | Unit mask description |
0 | Command DWORD sent |
1 | Address extension DWORD sent |
2 | Data DWORD sent |
3 | Buffer release DWORD sent |
4 | Nop DW sent (idle) |
5 | Per packet CRC sent |
7 | Sublink Mask |
Abbreviation: HT1 bandwidth
The number of Dwords transmitted (or unused, in the case of Nops) on the outgoing side of the HyperTransport links. The sum of all four subevents (all four UNIT_MASK bits set) directly reflects the maximum transmission rate of the link. Link utilization may be calculated by dividing the combined Command, Data and Buffer Release count (UNIT_MASK 07h) by that value plus the Nop count (UNIT_MASK 08h). Bandwidth in terms of bytes per unit time for any one component or combination of components is calculated by multiplying the count by four and dividing by elapsed time.
The Data event provides a direct indication of the flow of data around the system. Translating this link-based view into a source/target node based view requires knowledge of the system layout (i.e. which links connect to which nodes).
UnitMask[7] specifies the sublink to count if the link is unganged.
Value | Unit mask description |
0 | Command DWORD sent |
1 | Address extension DWORD sent |
2 | Data DWORD sent |
3 | Buffer release DWORD sent |
4 | Nop DW sent (idle) |
5 | Per packet CRC sent |
7 | Sublink Mask |
Abbreviation: HT2 bandwidth
The number of Dwords transmitted (or unused, in the case of Nops) on the outgoing side of the HyperTransport links. The sum of all four subevents (all four UNIT_MASK bits set) directly reflects the maximum transmission rate of the link. Link utilization may be calculated by dividing the combined Command, Data and Buffer Release count (UNIT_MASK 07h) by that value plus the Nop count (UNIT_MASK 08h). Bandwidth in terms of bytes per unit time for any one component or combination of components is calculated by multiplying the count by four and dividing by elapsed time.
The Data event provides a direct indication of the flow of data around the system. Translating this link-based view into a source/target node based view requires knowledge of the system layout (i.e., which links connect to which nodes).
UnitMask[7] specifies the sublink to count if the link is unganged.
Value | Unit mask description |
0 | Command DWORD sent |
1 | Address extension DWORD sent |
2 | Data DWORD sent |
3 | Buffer release DWORD sent |
4 | Nop DW sent (idle) |
5 | Per packet CRC sent |
7 | Sublink Mask |
Abbreviation: HT3 bandwidth
The number of Dwords transmitted (or unused, in the case of Nops) on the outgoing side of the HyperTransport links. The sum of all four subevents (all four UNIT_MASK bits set) directly reflects the maximum transmission rate of the link. Link utilization may be calculated by dividing the combined Command, Data and Buffer Release count (UNIT_MASK 07h) by that value plus the Nop count (UNIT_MASK 08h). Bandwidth in terms of bytes per unit time for any one component or combination of components is calculated by multiplying the count by four and dividing by elapsed time.
The Data event provides a direct indication of the flow of data around the system. Translating this link-based view into a source/target node based view requires knowledge of the system layout (i.e. which links connect to which nodes).
UnitMask[7] specifies the sublink to count if the link is unganged.
Value | Unit mask description |
0 | Command DWORD sent |
1 | Address extension DWORD sent |
2 | Data DWORD sent |
3 | Buffer release DWORD sent |
4 | Nop DW sent (idle) |
5 | Per packet CRC sent |
7 | Sublink Mask |
Abbreviation: Read req L3
This event tracks the read requests from each core to the L3 cache including read requests that are cancelled. The core tracked is selected using UnitMask[7:4]. One or more cores must be selected. To determine the total number of read requests from one core, select only a single core using UnitMask[7:4] and set UnitMask[2:0] to 111b.
Value | Unit mask description |
0 | Read Block Exclusive (Data cache read) |
1 | Read Block Shared (Instruction cache read) |
2 | Read Block Modify |
4 | Core 0 Select |
5 | Core 1 Select |
6 | Core 2 Select |
7 | Core 3 Select |
Abbreviation: L3 misses
This event counts the number of L3 cache misses for accesses from each core. The core tracked is selected using UnitMask[7:4]. One or more cores must be selected. To determine the total number of cache misses from one core, select only a single core using UnitMask[7:4] and set UnitMask[2:0] to 111b. The approximate number of L3 hits can be determined by subtracting this event from EventSelect 4E0h.
Value | Unit mask description |
0 | Read Block Exclusive (Data cache read) |
1 | Read Block Shared (Instruction cache read) |
2 | Read Block Modify |
4 | Core 0 Select |
5 | Core 1 Select |
6 | Core 2 Select |
7 | Core 3 Select |
Abbreviation: L3 fills
This event counts the number of L3 fills caused by L2 evictions. The core tracked is selected using UnitMask[7:4]. One or more cores must be selected.
Value | Unit mask description |
0 | Shared |
1 | Exclusive |
2 | Owned |
3 | Modified |
4 | Core 0 Select |
5 | Core 1 Select |
6 | Core 2 Select |
7 | Core 3 Select |
Abbreviation: L3 evictions
This event counts the state of the L3 lines when they are evicted from the L3 cache.
Value | Unit mask description |
0 | Shared |
1 | Exclusive |
2 | Owned |
3 | Modified |