And did that automatically do the SIMD scheduling that was the only thing that could have made Itanium fast?
Yes. For Intel for IA-64 compiler default options set to create as fastest as possible binaries. For example:
...
/O2 optimize for maximum speed (DEFAULT)
...
/Qvec[-] enables(DEFAULT)/disables vectorization
...
OpenMP was the easiest way to enable parallelization ( in our codes almost all for-loops have #pragma omp ... directives ):
...
/Qopenmp enable the compiler to generate multi-threaded code based on the OpenMP* directives
...
Auto-Parallelization was also available (!):
...
/Qparallel enable the auto-parallelizer to generate multi-threaded code for loops that can be safely executed in parallel
...
SIMD-like features for explicit application of vectorization was also available from fvec.h and dvec.h:
...
const union
{
int i[4];
__m128d m;
} __f64vec2_abs_mask_cheat = {0xffffffff, 0x7fffffff, 0xffffffff, 0x7fffffff};
#define _f64vec2_abs_mask ((F64vec2)__f64vec2_abs_mask_cheat.m)
/* EMM Functionality Intrinsics */
class I8vec16; /* 16 elements, each element a signed or unsigned char data type */
class Is8vec16; /* 16 elements, each element a signed char data type */
class Iu8vec16; /* 16 elements, each element an unsigned char data type */
class I16vec8; /* 8 elements, each element a signed or unsigned short */
class Is16vec8; /* 8 elements, each element a signed short */
class Iu16vec8; /* 8 elements, each element an unsigned short */
class I32vec4; /* 4 elements, each element a signed or unsigned long */
class Is32vec4; /* 4 elements, each element a signed long */
class Iu32vec4; /* 4 elements, each element a unsigned long */
class I64vec2; /* 2 element, each a __m64 data type */
class I128vec1; /* 1 element, a __m128i data type */
...
Last edited: