Profiling Howto
To choose an appropriate propagation method and parameters for a given problem, it is essential to benchmark and profile the propagation.
Consider the following simple example:
using QuantumControlTestUtils.RandomObjects: random_dynamic_generator, random_state_vector
tlist = collect(range(0, step=1.0, length=101));
N = 200; # size of Hilbert space
H = random_dynamic_generator(N, tlist);
Ψ₀ = random_state_vector(N);
BenchmarkTools
The first line of defense is the use of BenchmarkTools
. The @benchmark
macro allows to generate statistics on how long a call to propagate
takes.
Chebychev propagation
For example, we can time the propagation with the Chebychev method:
using BenchmarkTools
using QuantumPropagators
using QuantumPropagators: Cheby
@benchmark propagate($Ψ₀, $H, $tlist; method=Cheby, check=false) samples=10
BenchmarkTools.Trial: 10 samples with 1 evaluation.
Range (min … max): 27.866 ms … 30.067 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 28.081 ms ┊ GC (median): 0.00%
Time (mean ± σ): 28.409 ms ± 723.180 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
█▁ █ ▁▁ ▁ ▁ ▁
██▁█▁▁▁▁██▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
27.9 ms Histogram: frequency by time 30.1 ms <
Memory estimate: 1.01 MiB, allocs estimate: 3034.
Newton propagation
Or, the same propagation with the Newton method:
using QuantumPropagators: Newton
@benchmark propagate($Ψ₀, $H, $tlist; method=Newton, check=false) samples=10
BenchmarkTools.Trial: 10 samples with 1 evaluation.
Range (min … max): 94.536 ms … 100.438 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 95.568 ms ┊ GC (median): 0.00%
Time (mean ± σ): 96.639 ms ± 2.142 ms ┊ GC (mean ± σ): 0.00% ± 0.00%
▁ ▁ █ ▁ ▁ ▁ ▁ ▁ ▁
█▁▁█▁█▁▁▁█▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁█▁▁▁▁▁▁▁▁▁▁▁█ ▁
94.5 ms Histogram: frequency by time 100 ms <
Memory estimate: 17.43 MiB, allocs estimate: 54267.
The result in this case illustrates the significant advantage of the Chebychev method for systems of moderate to small size and unitary dynamics.
When using custom data structures for the dynamical generators or states, @benchmark
should also be used to optimize lower-level operations as much as possible, e.g. the application of the Hamiltonian to the state.
TimerOutputs
A lot more insight into the internals of a propagate
call can be obtained by collecting timing data. This functionality is integrated in QuantumPropagators
and uses the TimerOutputs
package internally.
Enabling the collection of timing data
To enable collecting internal timing data, call QuantumPropagators.enable_timings
:
QuantumPropagators.enable_timings()
The status of the data collection can be verified with QuantumPropagators.timings_enabled
.
Chebychev propagation
Since the call to QuantumPropagators.enable_timings
invalidates existing compiled code, and to avoid the compilation overhead showing up in the timing data, we call propagate
once to ensure compilation:
propagate(Ψ₀, H, tlist; method=Cheby);
In any subsequent propagation, we could access the timing data in a callback
to propagate
:
function show_timing_data(propagator, args...)
if propagator.t == tlist[end]
show(propagator.timing_data, compact=true)
end
end
propagate(Ψ₀, H, tlist; method=:cheby, callback=show_timing_data);
───────────────────────────────────────────────────────────────────
Time Allocations
─────────────── ───────────────
Total measured: 289ms 19.7MiB
Section ncalls time %tot alloc %tot
───────────────────────────────────────────────────────────────────
prop_step! 100 28.2ms 100.0% 134KiB 100.0%
matrix-vector product 1.30k 26.5ms 93.8% 0.00B 0.0%
───────────────────────────────────────────────────────────────────
See the TimerOutputs
documentation for details on how to print the timing_data
.
Alternatively, without a callback
:
propagator = init_prop(Ψ₀, H, tlist; method=Cheby)
for step ∈ 1:(length(tlist)-1)
prop_step!(propagator)
end
show(propagator.timing_data, compact=true)
───────────────────────────────────────────────────────────────────
Time Allocations
─────────────── ───────────────
Total measured: 52.1ms 416KiB
Section ncalls time %tot alloc %tot
───────────────────────────────────────────────────────────────────
prop_step! 100 28.6ms 100.0% 134KiB 100.0%
matrix-vector product 1.30k 26.9ms 94.1% 0.00B 0.0%
───────────────────────────────────────────────────────────────────
The reported runtimes here are less important than the number of function calls and the runtime percentages. In this case, the timing data shows that the propagation is dominated by the matrix-vector products (applying the Hamiltonian to the state), as it should. The percentage would go to 100% for larger Hilbert spaces.
Newton propagation
For the Newton method:
propagate(Ψ₀, H, tlist; method=Newton); # recompilation
propagate(Ψ₀, H, tlist; method=Newton, callback=show_timing_data);
───────────────────────────────────────────────────────────────────────────
Time Allocations
─────────────── ───────────────
Total measured: 142ms 19.0MiB
Section ncalls time %tot alloc %tot
───────────────────────────────────────────────────────────────────────────
prop_step! 100 100ms 100.0% 17.4MiB 100.0%
arnoldi! 200 46.2ms 46.4% 16.4KiB 0.1%
matrix-vector product 2.00k 42.9ms 43.0% 0.00B 0.0%
diagonalize_hessenberg_matrix 200 26.3ms 26.3% 9.76MiB 56.2%
get Leja points 200 23.2ms 23.3% 6.25KiB 0.0%
evaluate polynomial 200 1.64ms 1.6% 6.33MiB 36.4%
get Newton coeffs 200 232μs 0.2% 3.12KiB 0.0%
───────────────────────────────────────────────────────────────────────────
We see here that the Newton propagation requires more matrix-vector products (2000 compared to 1200 for Chebychev), partly because the Newton propagator is "chunked" to m_max
applications in each "restart" (10 by default, with 2 restarts required to reach machine precision in this case). Moreover, there is significant overhead beyond just matrix-vector multiplication, which will disappear only for significantly larger Hilbert spaces.
Disabling the collection of timing data
There there is a small overhead associated with collecting the timing data, it should not be enabled "in production". To QuantumPropagators.disable_timings
function undoes the previous QuantumPropagators.enable_timings
:
QuantumPropagators.disable_timings()
This again will trigger recompilation of any method that was collecting timing data, removing the associated overhead.