SSE4 is a set of instructions released in conjunction with Intel’s Penryn processor. SSE4, built upon the Intel 64 Instruction Set Architecture, represented Intel’s first major change to its instruction set for some time, and followed smaller changes introduced (in the guise of SSE3) with the Prescott (horizontal add/subtract) and Core 2 Duo (absolute value and double-width align) processors.
Intel believes that SSE4 offers the greatest change to the x86 instruction set in five years and allows the Penryn clock to run at higher frequencies than its Core 2 parents but within the same cool thermal envelope. While this only benefits applications that are able to use SSE4 (like media encoding), the speed improvements are reported to be in the order of 40%.
There are around 50 new instructions in the SSE4 set, the majority of which are instructions to utilize parallelised code and data-structures, making it easier to take full advantage the Penryn’s multi-core processors and its multi-threading capability.
SSE is an acronym for ‘Streaming SIMD Extensions’. The general concept behind these instructions is to combine certain common operations into one smooth operation: rather than a series of x instructions required for, say, discovering the dot product of two vectors, SSE provides one dedicated instruction. SSE reduces complex operations into native instructions, and this can greatly improve the efficiency of the processor in certain applications.
SSE4 made 47 new instructions available with the Penryn processor. Most of the new instructions are related to vector operations, which are the staple of graphics and multimedia processing. Also included are primitives that increase the speed of streaming and improves access to device memory. Intel groups the instructions into two sets: ‘Vectorizing Compiler and Media Accelerators’ and ‘Efficient Accelerated String and Text Processing’. The table below is summarised from the Intel SSE4 Programming Reference, and provides a brief summary of the new instructions and their benefits:
Instruction | Description |
---|---|
BLENDPD, BLENDPS, BLENDVPD, BLENDVPS, PBLENDVB, PBLENDW | Blend Packed Double and Single Precision Floating-Point Values |
CRC32 | Accumulate CRC32 Value |
DPPD, DPPS | Dot Product of Packed Double and Single Precision Floating-Point Value |
EXTRACTPS, INSERTPS | Extract and Insert Packed Single Precision Floating-Point Value |
MOVNTDQA | Load Double Quadword Non-Temporal Aligned Hint |
MPSADBW | Compute Multiple Packed Sums of Absolute Difference |
PACKUSDW | Pack with Unsigned Saturation |
PCMPESTRI, PCMPISTRI | Packed Compare Explicit and Implicit Length Strings, Return Index |
PCMPESTRM, PCMPISTRM | Packed Compare Explicit and Implicit Length Strings, Return Mask |
PCMPEQQ, PCMPGTQ | Compare Packed Data For Equal or Greater Than |
PEXTRB, PEXTRD/PEXTRQ, PEXTRW | Extract Byte, Dword/Qword, and Word |
PHMINPOSUW | Packed Horizontal Word Minimum |
PINSRB, PINSRD/PINSRQ | Insert Byte and Dword/Qword |
PMAXSB, PMAXSD, PMAXUD, PMAXUW, PMINSB, PMINSD, PMINUD, PMINUW | Find Minimum and Maximum of Packed Signed, Unsigned, Dword and Word-length Integers |
PMOVSX, PMOVZX | Packed Move with Sign and Zero Extend |
PMULDQ, PMULLD | Multiply Packed Signed Dword Integers and Store Low Result |
POPCNT | Return the Count of Number of Bits Set to 1 |
PTEST | Logical Compare |
ROUNDPD, ROUNDPS, ROUNDSD, ROUNDSS | Round Packed and ScalarDouble and Single Precision Floating-Point Values |
The 47 instructions available on Penryn represented the initial SSE4.1 release, with a further 7 instructions constituted Intel’s SSE4.2 release.
Previous versions of SSE have been licensed to AMD for use on its chips. It was unclear at the time of release whether SSE4 would be licensed in the same way.
In order to utilise the additional instructions fully code needs to be developed with it in mind from the very start. In particular compilers need to be modified to take advantage of the new native instructions and to this end Intel released a new version of its C compiler to coincide with Penryn’s debut.�As is usually the case with processor enhancements, either in hard, firm or software, the benefits are unlikely to manifest themselves immediately and it is the job of software engineers to realise SSE4’s potential.