GCC别名检查瓦特/限制指针

考虑以下两个片段：

#define ALIGN_BYTES 32 #define ASSUME_ALIGNED(x) x = __builtin_assume_aligned(x, ALIGN_BYTES) void fn0(const float *restrict a0, const float *restrict a1, float *restrict b, int n) { ASSUME_ALIGNED(a0); ASSUME_ALIGNED(a1); ASSUME_ALIGNED(b); for (int i = 0; i < n; ++i) b[i] = a0[i] + a1[i]; } void fn1(const float *restrict *restrict a, float *restrict b, int n) { ASSUME_ALIGNED(a[0]); ASSUME_ALIGNED(a[1]); ASSUME_ALIGNED(b); for (int i = 0; i < n; ++i) b[i] = a[0][i] + a[1][i]; }

当我编译函数为gcc-4.7.2 -Ofast -march=native -std=c99 -ftree-vectorizer-verbose=5 -S test.c -Wall我发现GCC为第二个函数插入别名检查。

我怎样才能防止这样的结果fn1assembly是一样的fn0 ？（当参数个数从三个增加到三十个时，parameter passing方法（ fn0 ）变得很麻烦，并且在fn1方法中混叠检查的fn1变得荒谬。）

组装（x86-64，支持AVX的芯片）; 在LFB10走样

 fn0: .LFB9: .cfi_startproc testl %ecx, %ecx jle .L1 movl %ecx, %r10d shrl $3, %r10d leal 0(,%r10,8), %r9d testl %r9d, %r9d je .L8 cmpl $7, %ecx jbe .L8 xorl %eax, %eax xorl %r8d, %r8d .p2align 4,,10 .p2align 3 .L4: vmovaps (%rsi,%rax), %ymm0 addl $1, %r8d vaddps (%rdi,%rax), %ymm0, %ymm0 vmovaps %ymm0, (%rdx,%rax) addq $32, %rax cmpl %r8d, %r10d ja .L4 cmpl %r9d, %ecx je .L1 .L3: movslq %r9d, %rax salq $2, %rax addq %rax, %rdi addq %rax, %rsi addq %rax, %rdx xorl %eax, %eax .p2align 4,,10 .p2align 3 .L6: vmovss (%rsi,%rax,4), %xmm0 vaddss (%rdi,%rax,4), %xmm0, %xmm0 vmovss %xmm0, (%rdx,%rax,4) addq $1, %rax leal (%r9,%rax), %r8d cmpl %r8d, %ecx jg .L6 .L1: vzeroupper ret .L8: xorl %r9d, %r9d jmp .L3 .cfi_endproc .LFE9: .size fn0, .-fn0 .p2align 4,,15 .globl fn1 .type fn1, @function fn1: .LFB10: .cfi_startproc testq %rdx, %rdx movq (%rdi), %r8 movq 8(%rdi), %r9 je .L12 leaq 32(%rsi), %rdi movq %rdx, %r10 leaq 32(%r8), %r11 shrq $3, %r10 cmpq %rdi, %r8 leaq 0(,%r10,8), %rax setae %cl cmpq %r11, %rsi setae %r11b orl %r11d, %ecx cmpq %rdi, %r9 leaq 32(%r9), %r11 setae %dil cmpq %r11, %rsi setae %r11b orl %r11d, %edi andl %edi, %ecx cmpq $7, %rdx seta %dil testb %dil, %cl je .L19 testq %rax, %rax je .L19 xorl %ecx, %ecx xorl %edi, %edi .p2align 4,,10 .p2align 3 .L15: vmovaps (%r9,%rcx), %ymm0 addq $1, %rdi vaddps (%r8,%rcx), %ymm0, %ymm0 vmovaps %ymm0, (%rsi,%rcx) addq $32, %rcx cmpq %rdi, %r10 ja .L15 cmpq %rax, %rdx je .L12 .p2align 4,,10 .p2align 3 .L20: vmovss (%r9,%rax,4), %xmm0 vaddss (%r8,%rax,4), %xmm0, %xmm0 vmovss %xmm0, (%rsi,%rax,4) addq $1, %rax cmpq %rax, %rdx ja .L20 .L12: vzeroupper ret .L19: xorl %eax, %eax jmp .L20 .cfi_endproc

有告诉编译器停止检查别名：

请添加行：

 #pragma GCC ivdep

在你想要vector化的循环前面，如果你需要更多的信息，请阅读：

https://gcc.gnu.org/onlinedocs/gcc-4.9.2/gcc/Loop-Specific-Pragmas.html

这可以帮助吗？

 void fn1(const float **restrict a, float *restrict b, int n) { const float * restrict a0 = a[0]; const float * restrict a1 = a[1]; ASSUME_ALIGNED(a0); ASSUME_ALIGNED(a1); ASSUME_ALIGNED(b); for (int i = 0; i < n; ++i) b[i] = a0[i] + a1[i]; }

编辑：第二次尝试:)。从http://locklessinc.com/articles/vectorize/获取信息;

gcc --fast-math ...

那么，国旗怎么样？

 -fno-strict-aliasing

？

正如我理解你的权利，你只是想知道如何把这个检查closures？如果是这样，这个参数到gcc命令行应该是帮助你。

编辑：

除了你的评论：是不是禁止使用consttypes限制指针？

这是来自ISO / IEC 9899（6.7.3.1限制的正式定义）：

1。

设D是一个普通标识符的声明，它提供了一个指定一个对象P作为限制types指针的方法。

4。

在B的每个执行过程中，令L是任何具有基于P的L的左值。如果L用于访问它所指定的对象X的值，并且X也被修改（通过任何方式），则以下要求适用T不能是const限定的。用于访问X值的每个其他左值也应该具有基于P的地址。为了本子条款的目的，每个修改X的访问也将被视为修改P. 如果P被分配了基于与块B2相关联的另一个受限指针对象P2的指针expression式E的值，则B2的执行应该在执行B之前开始，或者B2的执行应该在分配。如果这些要求不符合，那么行为是不确定的。

还有一个更有意思的地方，和寄存器一样是这样的：

6。

译者可以自由地忽略使用限制的任何或所有别名含义。

所以如果你找不到一个强制gcc这样做的命令参数，那么它可能是不可能的，因为从标准来说，它不需要做出这样的select。

我提前道歉，因为我无法在我的机器上重现GCC 4.7的结果，但有两种可能的解决scheme。

使用typedef正确组合* restrict * restrict 。据一位开发LLVM编译器的前同事说，这就是typedef的一个例外，就像C语言中的预处理器一样，它允许你想要的抗锯齿行为。

我试图在下面，但我不知道我成功了。请仔细检查我的尝试。
使用C99可变长度数组（VLA）使用限定符的答案中描述的语法。

我试图在下面，但我不知道我成功了。请仔细检查我的尝试。

这是我用来执行我的实验的代码，但是我无法确定如果我的任何build议是按照预期工作的。

 #define ALIGN_BYTES 32 #define ASSUME_ALIGNED(x) x = __builtin_assume_aligned(x, ALIGN_BYTES) void fn0(const float *restrict a0, const float *restrict a1, float *restrict b, int n) { ASSUME_ALIGNED(a0); ASSUME_ALIGNED(a1); ASSUME_ALIGNED(b); for (int i = 0; i < n; ++i) b[i] = a0[i] + a1[i]; } #if defined(ARRAY_RESTRICT) void fn1(const float *restrict a[restrict], float * restrict b, int n) #elif defined(TYPEDEF_SOLUTION) typedef float * restrict frp; void fn1(const frp *restrict a, float *restrict b, int n) #else void fn1(const float *restrict *restrict a, float *restrict b, int n) #endif { //ASSUME_ALIGNED(a[0]); ASSUME_ALIGNED(a[1]); ASSUME_ALIGNED(b); for (int i = 0; i < n; ++i) b[i] = a[0][i] + a[1][i]; }

再一次，我为这个答案的半熟性质而道歉。请不要投我一票，但不能成功。

GCC别名检查瓦特/限制指针

iPhone – dequeueReusableCellWithIdentifier用法

优化C＃/。NET程序的提示

为什么处理sorting后的数组比未sorting的数组更快？

为什么lambda可以比普通函数更好地被编译器优化？

使用优化的Levenshteinalgorithm寻找最近的邻居

什么是复制elision和返回值优化？

如何有效地使用MySQLDB的SScursor？

最快的方式列出N以下的所有素数

Pow（）与const非整数指数的优化？

什么是最快/最有效的方法来find一个整数在C中的最高设置位（MSB）？