What we decided to do was to optimize for the common scenario of uniform blocks of 128-bit (or all scalar) code separated from uniform blocks of 256-bit code. We maintain an internal record of when we transition between states where the upper bits contain something nonzero - to a point where the state is guaranteed to be zero. We give you a fast (1* cycle throughput) way to the second state VZEROUPPER (though VZEROALL, XRSTOR and reboot also work). Once you're in that state of Zeroed-upperness, you can execute 128-bit code (or scalar) - VEX prefixed or not - and you pay no transition penalty. You can also transition back to executing 256-bit instructions also with no penalty. You can transition freely between any VEXed instruction of any width and pay no penalty. The downside is that if you try to move from 256-bit instructions to legacy 128-bit instructions without that VZEROUPPER, you're going to pay. The way we chose to make you pay is to optimize for the common use of long blocks of SSE code you pay once during the transition to legacy 128-bit code instead of on every instruction. We do it by copying the upper 128-bits of all 16 regis
ters to a special scratchpad and this copying takes time (something like 50 cycles - still TBD). Then the legacy SSE code can operate as long as it wants with no penalty. When you transition back to a VEX-prefixed instruction, you have to pay the penalty again to restore state from that scratchpad.
The solution for this problem is for software to use VZEROUPPER prior to leaving your basic block of 256-bit code. Use it prior to calling any (ABI compliant) function, and prior to any blind return jump.