Why use performance counters?
Performance counters are really useful in order to find bottlenecks in real applications. They report very accurate metrics regarding how the GPU is idle/busy.
How to use performance counters?
GALLIUM_HUD
GALLIUM_HUD is an environment variable in mesa which can be used to monitor performance counters with a nice interface.
To get the list of available queries (which can be different regarding your chipset), run 'GALLIUM_HUD="help" glxgears'.
Once you have that list, select which queries you want to monitor, for example 'inst_executed' and start monitoring with 'GALLIUM_HUD="inst_executed" glxgears'.
apitrace
apitrace is tool for tracing graphics APIs like OpenGL, but it can also be used to replay a trace and monitor perf counters per frames or per draw-calls.
To get the list of available queries, run 'glretrace --list-metrics
Once you have that list, select which queries you want to monitor before replaying the trace and run 'glretrace --pframes=GL_AMD_performance_monitor:inst_executed' for example.
Global perf counters
These performance counters are global. They are configured through PCOUNTER by writing values directly to MMIO. The kernelspace interface is already merged but the userspace one is still WIP.
Status
Hardware events | NV50 |
geom_primitive_in_count | WIP |
geom_primitive_out_count | WIP |
geom_vertex_in_count | WIP |
geom_vertex_out_count | WIP |
gld_128b | WIP |
gld_32b | WIP |
gld_64b | WIP |
gld_coherent | WIP |
gld_incoherent | WIP |
gld_request | WIP |
gld_total | WIP |
gpu_idle | WIP |
gst_128b | WIP |
gst_32b | WIP |
gst_64b | WIP |
gst_coherent | WIP |
gst_incoherent | WIP |
gst_request | WIP |
gst_total | WIP |
input_asembler_waits_for_fb | WIP |
input_assembler_busy | WIP |
local_load | WIP |
local_store | WIP |
rasterizer_tiles_in_count | WIP |
rasterizer_tiles_killed_by_zcull_count | WIP |
rop_busy | WIP |
rop_samples_killed_by_earlyz_count | WIP |
rop_samples_killed_by_latez_count | WIP |
rop_waits_for_fb | WIP |
rop_waits_for_shader | WIP |
setup_line_count | WIP |
setup_point_count | WIP |
setup_primitive_count | WIP |
setup_primitive_culled_count | WIP |
setup_triangle_count | WIP |
stream_out_busy | WIP |
tex_cache_hit | WIP |
tex_cache_miss | WIP |
tex_waits_for_fb | WIP |
vertex_attribute_count | WIP |
MP perf counters
These performance counters are per-context. They are configured through the command stream and we use a compute shader to read back the values (ie. $pm0..$pm7 sregs).
Status
Hardware events | SM2012 | SM212 | SM30 | SM35 | SM50 |
active_cycles | DONE | DONE | DONE | DONE | DONE |
active_ctas | N/A | N/A | N/A | N/A | DONE |
active_warps | DONE | DONE | WIP | WIP | DONE |
atom_cas_count | N/A | N/A | DONE | DONE | N/A |
atom_count | DONE | DONE | DONE | DONE | DONE |
branch/divergent_branch | DONE | DONE | DONE | N/A | DONE |
{gld,gst}_request | DONE | DONE | DONE | DONE | N/A |
global_atom_cas | N/A | N/A | N/A | N/A | DONE |
global_{load,store} | N/A | N/A | N/A | N/A | DONE |
global_{ld,st}_mem_divergence_replays | N/A | N/A | DONE | DONE | N/A |
global_store_transaction | TODO | TODO | DONE | DONE | N/A |
gred_count | DONE | DONE | DONE | DONE | DONE |
inst_executed | DONE | DONE | DONE | DONE | DONE |
inst_issued (and variants) | DONE | DONE | DONE | DONE | DONE |
l1_global_load_{hit,miss} | TODO | TODO | DONE | DONE | N/A |
__l1_global_{load,store}_transactions | N/A | N/A | DONE | DONE | N/A |
l1_local_{load,store}_{hit,miss} | TODO | TODO | DONE | DONE | N/A |
l1_shared_{load,store}_transactions | N/A | N/A | DONE | DONE | N/A |
local_{load,store} | DONE | DONE | DONE | DONE | DONE |
local_{load,store}_transactions | N/A | N/A | DONE | DONE | N/A |
not_predicated_off_thread_inst_executed | N/A | N/A | N/A | DONE | DONE |
prof_trigger_{00-07} | DONE | DONE | DONE | DONE | DONE |
shared_atom | N/A | N/A | N/A | N/A | DONE |
shared_atom_cas | N/A | N/A | N/A | N/A | DONE |
shared_{ld,st}_transactions | N/A | N/A | N/A | N/A | DONE |
shared_{load,store} | DONE | DONE | DONE | DONE | DONE |
shared_{load,store}_bank_conflict | N/A | N/A | N/A | N/A | DONE |
shared_{load,store}_replay | N/A | N/A | DONE | DONE | N/A |
sm_cta_launched | TODO | TODO | DONE | DONE | DONE |
thread_inst_executed (and variants) | DONE | DONE | N/A | DONE | DONE |
threads_launched | DONE | DONE | DONE | DONE | N/A |
uncached_global_load_transaction | TODO | TODO | DONE | DONE | N/A |
warps_launched | DONE | DONE | DONE | DONE | DONE |
Notes
1 MP perf counters on GF100/GF110 (SM20) are buggy because we have a context-switch problem that needs to be fixed. Results might be wrong, be careful!
2 TODO means those perf counters are exposed through PCOUNTER.
Metrics
Status
Name | SM20 | SM21 | SM30 | SM35 | SM50 |
achieved_occupancy | DONE | DONE | DONE | DONE | DONE |
alu_fu_utilization | TODO | TODO | TODO | TODO | TODO |
atomic_replay_overhead | TODO | TODO | TODO | TODO | TODO |
atomic_throughput | TODO | TODO | TODO | TODO | TODO |
atomic_transactions | TODO | TODO | TODO | TODO | TODO |
atomic_transactions_per_request | TODO | TODO | TODO | TODO | TODO |
branch_efficiency | DONE | DONE | DONE | N/A | DONE |
cf_executed | TODO | TODO | TODO | TODO | TODO |
cf_fu_utilization | TODO | TODO | TODO | TODO | TODO |
cf_issued | TODO | TODO | TODO | TODO | TODO |
dram_read_throughput | TODO | TODO | TODO | TODO | TODO |
dram_read_transactions | TODO | TODO | TODO | TODO | TODO |
dram_utilization | TODO | TODO | TODO | TODO | TODO |
dram_write_throughput | TODO | TODO | TODO | TODO | TODO |
dram_write_transactions | TODO | TODO | TODO | TODO | TODO |
eligible_warps_per_cycle | TODO | TODO | TODO | TODO | TODO |
flop_count_dp | TODO | TODO | TODO | TODO | TODO |
flop_count_d | TODO | TODO | TODO | TODO | TODO |
flop_count_dp_fma | TODO | TODO | TODO | TODO | TODO |
flop_count_dp_mul | TODO | TODO | TODO | TODO | TODO |
flop_count_sp | TODO | TODO | TODO | TODO | TODO |
flop_count_sp_add | TODO | TODO | TODO | TODO | TODO |
flop_count_sp_fma | TODO | TODO | TODO | TODO | TODO |
flop_count_sp_mul | TODO | TODO | TODO | TODO | TODO |
flop_count_sp_special | TODO | TODO | TODO | TODO | TODO |
flop_dp_efficiency | TODO | TODO | TODO | TODO | TODO |
flop_sp_efficiency | TODO | TODO | TODO | TODO | TODO |
gld_efficiency | TODO | TODO | TODO | TODO | TODO |
gld_requested_throughput | TODO | TODO | TODO | TODO | TODO |
gld_throughput | TODO | TODO | TODO | TODO | TODO |
gld_transactions | TODO | TODO | TODO | TODO | TODO |
gld_transactions_per_request | TODO | TODO | TODO | TODO | TODO |
global_cache_replay_overhead | TODO | TODO | TODO | TODO | TODO |
gst_efficiency | TODO | TODO | TODO | TODO | TODO |
gst_requested_throughput | TODO | TODO | TODO | TODO | TODO |
gst_throughput | TODO | TODO | TODO | TODO | TODO |
gst_transactions | TODO | TODO | TODO | TODO | TODO |
gst_transactions_per_request | TODO | TODO | TODO | TODO | TODO |
inst_bit_convert | TODO | TODO | TODO | TODO | TODO |
inst_compute_ld_st | TODO | TODO | TODO | TODO | TODO |
inst_control | TODO | TODO | TODO | TODO | TODO |
inst_executed | TODO | TODO | TODO | TODO | TODO |
inst_fp_32 | TODO | TODO | TODO | TODO | TODO |
inst_fp_64 | TODO | TODO | TODO | TODO | TODO |
inst_integer | TODO | TODO | TODO | TODO | TODO |
inst_inter_thread_communication | TODO | TODO | TODO | TODO | TODO |
inst_issued | N/A | DONE | DONE | DONE | DONE |
inst_misc | TODO | TODO | TODO | TODO | TODO |
inst_per_warp | DONE | DONE | DONE | DONE | DONE |
inst_replay_overhead | DONE | DONE | DONE | DONE | DONE |
ipc | DONE | DONE | DONE | DONE | DONE |
issued_ipc | DONE | DONE | DONE | DONE | DONE |
issue_slots | N/A | DONE | DONE | DONE | DONE |
issue_slot_utilization | DONE | DONE | DONE | DONE | DONE |
l1_cache_global_hit_rate | TODO | TODO | TODO | TODO | TODO |
l1_cache_local_hit_rate | TODO | TODO | TODO | TODO | TODO |
l1_shared_utilization | TODO | TODO | TODO | TODO | TODO |
l2_atomic_throughput | TODO | TODO | TODO | TODO | TODO |
l2_atomic_transactions | TODO | TODO | TODO | TODO | TODO |
l2_l1_read_hit_rate | TODO | TODO | TODO | TODO | TODO |
l2_l1_read_throughput | TODO | TODO | TODO | TODO | TODO |
l2_l1_read_transactions | TODO | TODO | TODO | TODO | TODO |
l2_l1_write_throughput | TODO | TODO | TODO | TODO | TODO |
l2_l1_write_transactions | TODO | TODO | TODO | TODO | TODO |
l2_read_throughput | TODO | TODO | TODO | TODO | TODO |
l2_read_transactions | TODO | TODO | TODO | TODO | TODO |
l2_tex_read_transactions | TODO | TODO | TODO | TODO | TODO |
l2_texture_read_hit_rate | TODO | TODO | TODO | TODO | TODO |
l2_texture_read_throughput | TODO | TODO | TODO | TODO | TODO |
l2_utilization | TODO | TODO | TODO | TODO | TODO |
l2_write_throughput | TODO | TODO | TODO | TODO | TODO |
l2_write_transactions | TODO | TODO | TODO | TODO | TODO |
ldst_executed | TODO | TODO | TODO | TODO | TODO |
ldst_fu_utilization | TODO | TODO | TODO | TODO | TODO |
ldst_issued | TODO | TODO | TODO | TODO | TODO |
local_load_throughput | TODO | TODO | TODO | TODO | TODO |
local_load_transactions | TODO | TODO | TODO | TODO | TODO |
local_load_transactions_per_request | TODO | TODO | TODO | TODO | TODO |
local_memory_overhead | TODO | TODO | TODO | TODO | TODO |
local_replay_overhead | TODO | TODO | TODO | TODO | TODO |
local_store_throughput | TODO | TODO | TODO | TODO | TODO |
local_store_transactions | TODO | TODO | TODO | TODO | TODO |
local_store_transactions_per_request | TODO | TODO | TODO | TODO | TODO |
shared_efficiency | TODO | TODO | TODO | TODO | TODO |
shared_load_throughput | TODO | TODO | TODO | TODO | TODO |
shared_load_transactions | TODO | TODO | TODO | TODO | TODO |
shared_load_transactions_per_request | TODO | TODO | TODO | TODO | TODO |
shared_replay_overhead | N/A | N/A | DONE | DONE | TODO |
shared_store_throughput | TODO | TODO | TODO | TODO | TODO |
shared_store_transactions | TODO | TODO | TODO | TODO | TODO |
shared_store_transactions_per_request | TODO | TODO | TODO | TODO | TODO |
sm_efficiency | TODO | TODO | TODO | TODO | TODO |
stall_exec_dependency | TODO | TODO | TODO | TODO | TODO |
stall_inst_fetch | TODO | TODO | TODO | TODO | TODO |
stall_memory_dependency | TODO | TODO | TODO | TODO | TODO |
stall_memory_throttle | TODO | TODO | TODO | TODO | TODO |
stall_other | TODO | TODO | TODO | TODO | TODO |
stall_pipe_busy | TODO | TODO | TODO | TODO | TODO |
stall_sync | TODO | TODO | TODO | TODO | TODO |
stall_texture | TODO | TODO | TODO | TODO | TODO |
sysmem_read_throughput | TODO | TODO | TODO | TODO | TODO |
sysmem_read_transactions | TODO | TODO | TODO | TODO | TODO |
sysmem_utilization | TODO | TODO | TODO | TODO | TODO |
sysmem_write_throughput | TODO | TODO | TODO | TODO | TODO |
sysmem_write_transactions | TODO | TODO | TODO | TODO | TODO |
tex_cache_hit_rate | TODO | TODO | TODO | TODO | TODO |
tex_cache_throughput | TODO | TODO | TODO | TODO | TODO |
tex_cache_transactions | TODO | TODO | TODO | TODO | TODO |
tex_fu_utilization | TODO | TODO | TODO | TODO | TODO |
tex_utilization | TODO | TODO | TODO | TODO | TODO |
warp_execution_efficiency | TODO | TODO | DONE | DONE | DONE |
warp_nonpred_execution_efficiency | N/A | N/A | N/A | DONE | DONE |