Yes, I do have reports of >20M/s on RTX 4080, e.g. https://twitter.com/chrisswenor/status/1615828271022718976.
and my own initial benchmarks on cloud GPUs
— Benchmark Result
Devices: Tesla T4
Total: 3696351582 keys, matching: 100, time: 509.64s, avg: 7252901 keys/s
— Benchmark Result
Devices: NVIDIA L4
Total: 99000000 keys, matching: 100, time: 10.55s, avg: 9382530 keys/s
but those exclude batch size optimizations.