Intel absolutely has faster PDEP, PEXT, DIV, and IDIV instructions. I know, because I've measured them personally. In the case of PDEP / PEXT, 1000% faster or more (single-clock PDEP / PEXT on Intel, but over 15 clocks on AMD).
AMD and Intel both have single-clock add / subtract. AMD had the first dual-AES pipeline (supporting two concurrent AES instructions), so AMD's AES IPC was superior to Intel's AES IPC on Skylake.
What instructions, what conditions, what memory? Its actually very complicated. If we're sitting around adding/subtracting numbers inside of registers all day, both processors tie. MAC instructions, I forget from the top of my head... but you get the picture. Its pretty complicated, because all these CPUs have different speeds for every instruction.
DIV/IDIV
Context!
32bit operands? Sure, rock-solid throughput of 6 cycles for consecutive instructions > 1/6 DIV per cycle (xxx Lake uarch).
What about 64bit operands? Well, sides reverse don't they? (I assume) Micro-coded calculation, variable latency & variable throughput, twice slower than Zen2. What is even more intriguing, slower than Goldmond Plus uarch which does not spot fancy fixed 6cycle reciprocal throughput for operands shorter than 64bits!
PDEP/PEXT
GPR scatter-gather niche? Sure Intel created those instructions (BMI2), if they would be "micro-coded" on their flagship uarch, then I don't see a reason for even making them.
Differences between capabilities of ALUs of Intel and AMD architectures (not counting front-end, cache & memory) are much bigger than many "IPC" discussions admit.
What about POPCNT/LZCNT/TZCNT? (number of bits set to 1, number of leading zeroes, number of trailing zeroes - simple explanation for those, who are "not in the know")
Latency in Lake uarch is 3, AMD 1 (or 2 for TZ). That is not all. On Intel only a single port is capable of handling these instructions (throughput is 1 instruction per cycle). On AMD? All four ALUs are able to execute POP and LZ instructions (4 "IPC") and 2 TZ instructions & 2 "IPC".
There is one more implication based on the differences of Intel "Super-ALUs" vs. AMD clustered ALUs and separate GPR & SIMD "engines". By executing these three instructions you block
port1 which is really an abstraction of a "super ALU" (take a look at
https://en.wikichip.org/wiki/intel/microarchitectures/coffee_lake what else it is supposed to handle).
AMD design philosophy of a separate "vector engine" means, that they do not block each other (see
https://en.wikichip.org/wiki/amd/microarchitectures/zen_2 ).
Which I suppose is the reason why AMD sees higher gains when SMT is enabled as there is no way to utilize such wide engine in a single thread / one non-artificial instruction stream.
[/END of OT]
I am looking forward to Zen3 mostly because of unified L3 cache. At least in Azure, Zen2 CCX are exposed as NUMA nodes (starting with 16 "virtual core"/thread VMs) to let those not intimate with the details know, that spreading threads across all cores will randomly end up with cache misses even if you expect, from the code perspective, that it "should" be present in the cache.
Cache-sensitive workloads see actually higher perf degradation (vs theoretical *2) on Zen2 than on Cascade Lake when you double the size of VMs (from 4c/8t to 8c/16t).