Profiling Howto

To choose an appropriate propagation method and parameters for a given problem, it is essential to benchmark and profile the propagation.

Consider the following simple example:

using QuantumControlTestUtils.RandomObjects: random_dynamic_generator, random_state_vector

tlist = collect(range(0, step=1.0, length=101));
N = 200;  # size of Hilbert space
H = random_dynamic_generator(N, tlist);
Ψ₀ = random_state_vector(N);

BenchmarkTools

The first line of defense is the use of BenchmarkTools. The @benchmark macro allows to generate statistics on how long a call to propagate takes.

Chebychev propagation

For example, we can time the propagation with the Chebychev method:

using BenchmarkTools
using QuantumPropagators
using QuantumPropagators: Cheby

@benchmark propagate($Ψ₀, $H, $tlist; method=Cheby, check=false) samples=10
BenchmarkTools.Trial: 10 samples with 1 evaluation.
 Range (minmax):  25.758 ms 27.457 ms   GC (min … max): 0.00% … 0.00%
 Time  (median):     25.925 ms                GC (median):    0.00%
 Time  (mean ± σ):   26.256 ms ± 660.569 μs   GC (mean ± σ):  0.00% ± 0.00%

  █▁ ▁  ▁▁                         ▁                    ▁   ▁  
  ██▁█▁▁██▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁█ ▁
  25.8 ms         Histogram: frequency by time         27.5 ms <

 Memory estimate: 1.01 MiB, allocs estimate: 3035.

Newton propagation

Or, the same propagation with the Newton method:

using QuantumPropagators: Newton

@benchmark propagate($Ψ₀, $H, $tlist; method=Newton, check=false) samples=10
BenchmarkTools.Trial: 10 samples with 1 evaluation.
 Range (minmax):  93.642 ms96.840 ms   GC (min … max): 0.00% … 0.00%
 Time  (median):     94.805 ms               GC (median):    0.00%
 Time  (mean ± σ):   94.941 ms ±  1.087 ms   GC (mean ± σ):  0.38% ± 0.60%

  ▁  ▁▁        ▁     ▁   ▁                 █             ▁  
  █▁▁██▁▁▁▁▁▁▁▁█▁▁▁▁▁█▁▁█▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  93.6 ms         Histogram: frequency by time        96.8 ms <

 Memory estimate: 17.43 MiB, allocs estimate: 54268.

The result in this case illustrates the significant advantage of the Chebychev method for systems of moderate to small size and unitary dynamics.

When using custom data structures for the dynamical generators or states, @benchmark should also be used to optimize lower-level operations as much as possible, e.g. the application of the Hamiltonian to the state.

TimerOutputs

A lot more insight into the internals of a propagate call can be obtained by collecting timing data. This functionality is integrated in QuantumPropagators and uses the TimerOutputs package internally.

Enabling the collection of timing data

To enable collecting internal timing data, call QuantumPropagators.enable_timings:

QuantumPropagators.enable_timings()

The status of the data collection can be verified with QuantumPropagators.timings_enabled.

Chebychev propagation

Since the call to QuantumPropagators.enable_timings invalidates existing compiled code, and to avoid the compilation overhead showing up in the timing data, we call propagate once to ensure compilation:

propagate(Ψ₀, H, tlist; method=Cheby);

In any subsequent propagation, we could access the timing data in a callback to propagate:

function show_timing_data(propagator, args...)
    if propagator.t == tlist[end]
        show(propagator.timing_data, compact=true)
    end
end

propagate(Ψ₀, H, tlist; method=:cheby, callback=show_timing_data);
 ───────────────────────────────────────────────────────────────────
                                         Time          Allocations
                                   ───────────────   ───────────────
          Total measured:               266ms            19.7MiB

 Section                   ncalls     time    %tot     alloc    %tot
 ───────────────────────────────────────────────────────────────────
 prop_step!                   100   24.8ms  100.0%    134KiB  100.0%
   matrix-vector product    1.20k   23.3ms   93.8%     0.00B    0.0%
 ───────────────────────────────────────────────────────────────────

See the TimerOutputs documentation for details on how to print the timing_data.

Alternatively, without a callback:

propagator = init_prop(Ψ₀, H, tlist; method=Cheby)
for step ∈ 1:(length(tlist)-1)
    prop_step!(propagator)
end
show(propagator.timing_data, compact=true)
 ───────────────────────────────────────────────────────────────────
                                         Time          Allocations
                                   ───────────────   ───────────────
          Total measured:              48.2ms             415KiB

 Section                   ncalls     time    %tot     alloc    %tot
 ───────────────────────────────────────────────────────────────────
 prop_step!                   100   25.9ms  100.0%    134KiB  100.0%
   matrix-vector product    1.20k   24.3ms   93.8%     0.00B    0.0%
 ───────────────────────────────────────────────────────────────────

The reported runtimes here are less important than the number of function calls and the runtime percentages. In this case, the timing data shows that the propagation is dominated by the matrix-vector products (applying the Hamiltonian to the state), as it should. The percentage would go to 100% for larger Hilbert spaces.

Newton propagation

For the Newton method:

propagate(Ψ₀, H, tlist; method=Newton);   # recompilation
propagate(Ψ₀, H, tlist; method=Newton, callback=show_timing_data);
 ───────────────────────────────────────────────────────────────────────────
                                                 Time          Allocations
                                           ───────────────   ───────────────
              Total measured:                   135ms            19.0MiB

 Section                           ncalls     time    %tot     alloc    %tot
 ───────────────────────────────────────────────────────────────────────────
 prop_step!                           100   96.6ms  100.0%   17.4MiB  100.0%
   arnoldi!                           200   43.6ms   45.1%   16.4KiB    0.1%
     matrix-vector product          2.00k   40.4ms   41.8%     0.00B    0.0%
   diagonalize_hessenberg_matrix      200   25.9ms   26.8%   9.76MiB   56.2%
   get Leja points                    200   23.3ms   24.1%   6.25KiB    0.0%
   evaluate polynomial                200   1.52ms    1.6%   6.33MiB   36.4%
   get Newton coeffs                  200    228μs    0.2%   3.12KiB    0.0%
 ───────────────────────────────────────────────────────────────────────────

We see here that the Newton propagation requires more matrix-vector products (2000 compared to 1200 for Chebychev), partly because the Newton propagator is "chunked" to m_max applications in each "restart" (10 by default, with 2 restarts required to reach machine precision in this case). Moreover, there is significant overhead beyond just matrix-vector multiplication, which will disappear only for significantly larger Hilbert spaces.

Disabling the collection of timing data

There there is a small overhead associated with collecting the timing data, it should not be enabled "in production". To QuantumPropagators.disable_timings function undoes the previous QuantumPropagators.enable_timings:

QuantumPropagators.disable_timings()

This again will trigger recompilation of any method that was collecting timing data, removing the associated overhead.