multithreading是否强调内存碎片?
描述
当使用openmp的parallel构造来分配和释放具有4个或更multithreading的随机大小的内存块时,程序似乎开始在testing程序的运行时间的后半部分泄漏大量的内存。 因此,它将消耗的内存从1050 MB增加到1500 MB或更多,而实际上并没有使用额外的内存。
由于valgrind没有显示任何问题,我必须假设内存泄漏实际上是内存碎片的强调效果。
有趣的是,如果2个线程每个分配10000个分配,效果不会显示,但是如果4个线程分别分配5000个分配,则效果会很强。 另外,如果分配的块的最大大小减less到256kb(从1mb),效果变弱。
重型并发可以强调分割吗? 或者这更可能是一个堆中的错误?
testing程序说明
演示程序的构build是为了从堆中获取总共256 MB的随机大小的内存块,并进行5000次分配。 如果内存限制被触发,首先分配的块将被释放,直到内存消耗降到限制以下。 一旦执行了5000个分配,所有的内存都被释放,循环结束。 所有这些工作都是由openmp生成的每个线程完成的。
这种内存分配scheme允许我们预计每个线程(包括一些簿记数据)的内存消耗约为260 MB。
演示程序
由于这真的是你可能想要testing的东西,所以你可以用一个简单的makefile从下拉框下载示例程序。
按原样运行程序时,应该至less有1400 MB的RAM可用。 随意调整代码中的常量以满足您的需求。
为了完整,实际的代码如下:
#include <stdlib.h> #include <stdio.h> #include <iostream> #include <vector> #include <deque> #include <omp.h> #include <math.h> typedef unsigned long long uint64_t; void runParallelAllocTest() { // constants const int NUM_ALLOCATIONS = 5000; // alloc's per thread const int NUM_THREADS = 4; // how many threads? const int NUM_ITERS = NUM_THREADS;// how many overall repetions const bool USE_NEW = true; // use new or malloc? , seems to make no difference (as it should) const bool DEBUG_ALLOCS = false; // debug output // pre store allocation sizes const int NUM_PRE_ALLOCS = 20000; const uint64_t MEM_LIMIT = (1024 * 1024) * 256; // x MB per process const size_t MAX_CHUNK_SIZE = 1024 * 1024 * 1; srand(1); std::vector<size_t> allocations; allocations.resize(NUM_PRE_ALLOCS); for (int i = 0; i < NUM_PRE_ALLOCS; i++) { allocations[i] = rand() % MAX_CHUNK_SIZE; // use up to x MB chunks } #pragma omp parallel num_threads(NUM_THREADS) #pragma omp for for (int i = 0; i < NUM_ITERS; ++i) { uint64_t long totalAllocBytes = 0; uint64_t currAllocBytes = 0; std::deque< std::pair<char*, uint64_t> > pointers; const int myId = omp_get_thread_num(); for (int j = 0; j < NUM_ALLOCATIONS; ++j) { // new allocation const size_t allocSize = allocations[(myId * 100 + j) % NUM_PRE_ALLOCS ]; char* pnt = NULL; if (USE_NEW) { pnt = new char[allocSize]; } else { pnt = (char*) malloc(allocSize); } pointers.push_back(std::make_pair(pnt, allocSize)); totalAllocBytes += allocSize; currAllocBytes += allocSize; // fill with values to add "delay" for (int fill = 0; fill < (int) allocSize; ++fill) { pnt[fill] = (char)(j % 255); } if (DEBUG_ALLOCS) { std::cout << "Id " << myId << " New alloc " << pointers.size() << ", bytes:" << allocSize << " at " << (uint64_t) pnt << "\n"; } // free all or just a bit if (((j % 5) == 0) || (j == (NUM_ALLOCATIONS - 1))) { int frees = 0; // keep this much allocated // last check, free all uint64_t memLimit = MEM_LIMIT; if (j == NUM_ALLOCATIONS - 1) { std::cout << "Id " << myId << " about to release all memory: " << (currAllocBytes / (double)(1024 * 1024)) << " MB" << std::endl; memLimit = 0; } //MEM_LIMIT = 0; // DEBUG while (pointers.size() > 0 && (currAllocBytes > memLimit)) { // free one of the first entries to allow previously obtained resources to 'live' longer currAllocBytes -= pointers.front().second; char* pnt = pointers.front().first; // free memory if (USE_NEW) { delete[] pnt; } else { free(pnt); } // update array pointers.pop_front(); if (DEBUG_ALLOCS) { std::cout << "Id " << myId << " Free'd " << pointers.size() << " at " << (uint64_t) pnt << "\n"; } frees++; } if (DEBUG_ALLOCS) { std::cout << "Frees " << frees << ", " << currAllocBytes << "/" << MEM_LIMIT << ", " << totalAllocBytes << "\n"; } } } // for each allocation if (currAllocBytes != 0) { std::cerr << "Not all free'd!\n"; } std::cout << "Id " << myId << " done, total alloc'ed " << ((double) totalAllocBytes / (double)(1024 * 1024)) << "MB \n"; } // for each iteration exit(1); } int main(int argc, char** argv) { runParallelAllocTest(); return 0; }
testing系统
从目前为止我所看到的硬件很重要。 如果在更快的机器上运行,testing可能需要调整。
Intel(R) Core(TM)2 Duo CPU T7300 @ 2.00GHz Ubuntu 10.04 LTS 64 bit gcc 4.3, 4.4, 4.6 3988.62 Bogomips
testing
一旦你执行了makefile,你应该得到一个名为ompmemtest
的文件。 要查询一段时间内存使用情况,我使用了以下命令:
./ompmemtest & top -b | grep ompmemtest
这产生了相当令人印象深刻的碎片或泄漏行为。 4线程的预期内存消耗为1090 MB,随着时间的推移变为1500 MB:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 11626 byron 20 0 204m 99m 1000 R 27 2.5 0:00.81 ompmemtest 11626 byron 20 0 992m 832m 1004 R 195 21.0 0:06.69 ompmemtest 11626 byron 20 0 1118m 1.0g 1004 R 189 26.1 0:12.40 ompmemtest 11626 byron 20 0 1218m 1.0g 1004 R 190 27.1 0:18.13 ompmemtest 11626 byron 20 0 1282m 1.1g 1004 R 195 29.6 0:24.06 ompmemtest 11626 byron 20 0 1471m 1.3g 1004 R 195 33.5 0:29.96 ompmemtest 11626 byron 20 0 1469m 1.3g 1004 R 194 33.5 0:35.85 ompmemtest 11626 byron 20 0 1469m 1.3g 1004 R 195 33.6 0:41.75 ompmemtest 11626 byron 20 0 1636m 1.5g 1004 R 194 37.8 0:47.62 ompmemtest 11626 byron 20 0 1660m 1.5g 1004 R 195 38.0 0:53.54 ompmemtest 11626 byron 20 0 1669m 1.5g 1004 R 195 38.2 0:59.45 ompmemtest 11626 byron 20 0 1664m 1.5g 1004 R 194 38.1 1:05.32 ompmemtest 11626 byron 20 0 1724m 1.5g 1004 R 195 40.0 1:11.21 ompmemtest 11626 byron 20 0 1724m 1.6g 1140 S 193 40.1 1:17.07 ompmemtest
请注意:使用gcc 4.3,4.4和4.6(trunk)进行编译时,可能会重现此问题。
好的,拿起诱饵。
这是在一个系统上
Intel(R) Core(TM)2 Quad CPU Q9550 @ 2.83GHz 4x5666.59 bogomips Linux meerkat 2.6.35-28-generic-pae #50-Ubuntu SMP Fri Mar 18 20:43:15 UTC 2011 i686 GNU/Linux gcc version 4.4.5 total used free shared buffers cached Mem: 8127172 4220560 3906612 0 374328 2748796 -/+ buffers/cache: 1097436 7029736 Swap: 0 0 0
天真的跑步
我刚刚跑了
time ./ompmemtest Id 0 about to release all memory: 258.144 MB Id 0 done, total alloc'ed -1572.7MB Id 3 about to release all memory: 257.854 MB Id 3 done, total alloc'ed -1569.6MB Id 1 about to release all memory: 257.339 MB Id 2 about to release all memory: 257.043 MB Id 1 done, total alloc'ed -1570.42MB Id 2 done, total alloc'ed -1569.96MB real 0m13.429s user 0m44.619s sys 0m6.000s
没什么特别的 这里是vmstat -SM 1
的同时输出
Vmstat原始数据
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- 0 0 0 3892 364 2669 0 0 24 0 701 1487 2 1 97 0 4 0 0 3421 364 2669 0 0 0 0 1317 1953 53 7 40 0 4 0 0 2858 364 2669 0 0 0 0 2715 5030 79 16 5 0 4 0 0 2861 364 2669 0 0 0 0 6164 12637 76 15 9 0 4 0 0 2853 364 2669 0 0 0 0 4845 8617 77 13 10 0 4 0 0 2848 364 2669 0 0 0 0 3782 7084 79 13 8 0 5 0 0 2842 364 2669 0 0 0 0 3723 6120 81 12 7 0 4 0 0 2835 364 2669 0 0 0 0 3477 4943 84 9 7 0 4 0 0 2834 364 2669 0 0 0 0 3273 4950 81 10 9 0 5 0 0 2828 364 2669 0 0 0 0 3226 4812 84 11 6 0 4 0 0 2823 364 2669 0 0 0 0 3250 4889 83 10 7 0 4 0 0 2826 364 2669 0 0 0 0 3023 4353 85 10 6 0 4 0 0 2817 364 2669 0 0 0 0 3176 4284 83 10 7 0 4 0 0 2823 364 2669 0 0 0 0 3008 4063 84 10 6 0 0 0 0 3893 364 2669 0 0 0 0 4023 4228 64 10 26 0
这些信息对你来说意味着什么?
Google线程cachingMalloc
现在真正的乐趣,添加一点香料
time LD_PRELOAD="/usr/lib/libtcmalloc.so" ./ompmemtest Id 1 about to release all memory: 257.339 MB Id 1 done, total alloc'ed -1570.42MB Id 3 about to release all memory: 257.854 MB Id 3 done, total alloc'ed -1569.6MB Id 2 about to release all memory: 257.043 MB Id 2 done, total alloc'ed -1569.96MB Id 0 about to release all memory: 258.144 MB Id 0 done, total alloc'ed -1572.7MB real 0m11.663s user 0m44.255s sys 0m1.028s
看起来更快,不是?
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- 4 0 0 3562 364 2684 0 0 0 0 1041 1676 28 7 64 0 4 2 0 2806 364 2684 0 0 0 172 1641 1843 84 14 1 0 4 0 0 2758 364 2685 0 0 0 0 1520 1009 98 2 1 0 4 0 0 2747 364 2685 0 0 0 0 1504 859 98 2 0 0 5 0 0 2745 364 2685 0 0 0 0 1575 1073 98 2 0 0 5 0 0 2739 364 2685 0 0 0 0 1415 743 99 1 0 0 4 0 0 2738 364 2685 0 0 0 0 1526 981 99 2 0 0 4 0 0 2731 364 2685 0 0 0 684 1536 927 98 2 0 0 4 0 0 2730 364 2685 0 0 0 0 1584 1010 99 1 0 0 5 0 0 2730 364 2685 0 0 0 0 1461 917 99 2 0 0 4 0 0 2729 364 2685 0 0 0 0 1561 1036 99 1 0 0 4 0 0 2729 364 2685 0 0 0 0 1406 756 100 1 0 0 0 0 0 3819 364 2685 0 0 0 4 1159 1476 26 3 71 0
如果你想比较vmstat输出
Valgrind --tool massif
这是在valgrind --tool=massif ./ompmemtest
之后valgrind --tool=massif ./ompmemtest
输出的头部valgrind --tool=massif ./ompmemtest
(默认malloc):
-------------------------------------------------------------------------------- Command: ./ompmemtest Massif arguments: (none) ms_print arguments: massif.out.beforetcmalloc -------------------------------------------------------------------------------- GB 1.009^ : | ##::::@@:::::::@@::::::@@::::@@::@::::@::::@:::::::::@::::::@::: | # :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@::: | # :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@::: | :# :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@::: | :# :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@::: | :# :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@:::: | ::# :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@:::: | ::# :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@:::: | ::# :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@:::: | ::# :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@:::: | ::# :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@:::: | ::::# :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@:::: | : ::# :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@:::: | : ::# :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@:::: | :: ::# :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@:::: | :: ::# :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@:::: | ::: ::# :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@:::: | ::: ::# :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@:::: | ::: ::# :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@:::: 0 +----------------------------------------------------------------------->Gi 0 264.0 Number of snapshots: 63 Detailed snapshots: [6 (peak), 10, 17, 23, 27, 30, 35, 39, 48, 56]
Google HEAPPROFILE
不幸的是,香草valgrind
不tcmalloc
工作,所以我换马匹midrace 堆分析与google-perftools
gcc openMpMemtest_Linux.cpp -fopenmp -lgomp -lstdc++ -ltcmalloc -o ompmemtest time HEAPPROFILE=/tmp/heapprofile ./ompmemtest Starting tracking the heap Dumping heap profile to /tmp/heapprofile.0001.heap (100 MB currently in use) Dumping heap profile to /tmp/heapprofile.0002.heap (200 MB currently in use) Dumping heap profile to /tmp/heapprofile.0003.heap (300 MB currently in use) Dumping heap profile to /tmp/heapprofile.0004.heap (400 MB currently in use) Dumping heap profile to /tmp/heapprofile.0005.heap (501 MB currently in use) Dumping heap profile to /tmp/heapprofile.0006.heap (601 MB currently in use) Dumping heap profile to /tmp/heapprofile.0007.heap (701 MB currently in use) Dumping heap profile to /tmp/heapprofile.0008.heap (801 MB currently in use) Dumping heap profile to /tmp/heapprofile.0009.heap (902 MB currently in use) Dumping heap profile to /tmp/heapprofile.0010.heap (1002 MB currently in use) Dumping heap profile to /tmp/heapprofile.0011.heap (2029 MB allocated cumulatively, 1031 MB currently in use) Dumping heap profile to /tmp/heapprofile.0012.heap (3053 MB allocated cumulatively, 1030 MB currently in use) Dumping heap profile to /tmp/heapprofile.0013.heap (4078 MB allocated cumulatively, 1031 MB currently in use) Dumping heap profile to /tmp/heapprofile.0014.heap (5102 MB allocated cumulatively, 1031 MB currently in use) Dumping heap profile to /tmp/heapprofile.0015.heap (6126 MB allocated cumulatively, 1033 MB currently in use) Dumping heap profile to /tmp/heapprofile.0016.heap (7151 MB allocated cumulatively, 1029 MB currently in use) Dumping heap profile to /tmp/heapprofile.0017.heap (8175 MB allocated cumulatively, 1029 MB currently in use) Dumping heap profile to /tmp/heapprofile.0018.heap (9199 MB allocated cumulatively, 1028 MB currently in use) Id 0 about to release all memory: 258.144 MB Id 0 done, total alloc'ed -1572.7MB Id 2 about to release all memory: 257.043 MB Id 2 done, total alloc'ed -1569.96MB Id 3 about to release all memory: 257.854 MB Id 3 done, total alloc'ed -1569.6MB Id 1 about to release all memory: 257.339 MB Id 1 done, total alloc'ed -1570.42MB Dumping heap profile to /tmp/heapprofile.0019.heap (Exiting) real 0m11.981s user 0m44.455s sys 0m1.124s
请联系我获取完整的日志/详细信息
更新
对评论:我更新了程序
--- omptest/openMpMemtest_Linux.cpp 2011-05-03 23:18:44.000000000 +0200 +++ q/openMpMemtest_Linux.cpp 2011-05-04 13:42:47.371726000 +0200 @@ -13,8 +13,8 @@ void runParallelAllocTest() { // constants - const int NUM_ALLOCATIONS = 5000; // alloc's per thread - const int NUM_THREADS = 4; // how many threads? + const int NUM_ALLOCATIONS = 55000; // alloc's per thread + const int NUM_THREADS = 8; // how many threads? const int NUM_ITERS = NUM_THREADS;// how many overall repetions const bool USE_NEW = true; // use new or malloc? , seems to make no difference (as it should)
它跑了超过5m3s。 接近尾声的时候,htop的屏幕截图实际上预留的设置略高一点,达到2.3g:
1 [
||96.7%] Tasks: 125 total, 2 running 2 [
||96.7%] Load average: 8.09 5.24 2.37 3 [
||97.4%] Uptime: 01:54:22 4 [
||96.1%] Mem[
| 3055/7936MB] Swp[ 0/0MB] PID USER NLWP PRI NI VIRT RES SHR S CPU% MEM% TIME+ Command 4330 sehe 8 20 0 2635M 2286M 908 R 368. 28.8 15:35.01 ./ompmemtest
将结果与tcmalloc运行结果进行比较:4分12秒, 类似的最高统计数字略有差异; 在VIRT集中有很大的区别(但是,除非每个进程的地址空间非常有限,否则这不是特别有用)。 如果你问我,那么RES的设定是非常相似的。 更重要的是要增加平行性; 所有的内核现在都被刷新了。 这显然是由于使用tcmalloc时减less了对堆操作的locking需求:
如果空闲列表是空的:(1)我们从这个size-class的中心空闲列表中获取一堆对象(中央空闲列表被所有线程共享)。 (2)将它们放在线程本地空闲列表中。 (3)将新提取的对象之一返回给应用程序。
1 [
100.0%] Tasks: 172 total, 2 running 2 [
100.0%] Load average: 7.39 2.92 1.11 3 [
100.0%] Uptime: 11:12:25 4 [
100.0%] Mem[
|| 3278/7936MB] Swp[ 0/0MB] PID USER NLWP PRI NI VIRT RES SHR S CPU% MEM% TIME+ Command 14391 sehe 8 20 0 2251M 2179M 1148 R 379. 27.5 8:08.92 ./ompmemtest
当链接testing程序与谷歌的tcmalloc库时,可执行文件不仅运行速度快了10%,而且还显示了大大减less或微不足道的内存碎片:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 13441 byron 20 0 379m 334m 1220 R 187 8.4 0:02.63 ompmemtestgoogle 13441 byron 20 0 1085m 1.0g 1220 R 194 26.2 0:08.52 ompmemtestgoogle 13441 byron 20 0 1111m 1.0g 1220 R 195 26.9 0:14.42 ompmemtestgoogle 13441 byron 20 0 1131m 1.1g 1220 R 195 27.4 0:20.30 ompmemtestgoogle 13441 byron 20 0 1137m 1.1g 1220 R 195 27.6 0:26.19 ompmemtestgoogle 13441 byron 20 0 1137m 1.1g 1220 R 195 27.6 0:32.05 ompmemtestgoogle 13441 byron 20 0 1149m 1.1g 1220 R 191 27.9 0:37.81 ompmemtestgoogle 13441 byron 20 0 1149m 1.1g 1220 R 194 27.9 0:43.66 ompmemtestgoogle 13441 byron 20 0 1161m 1.1g 1220 R 188 28.2 0:49.32 ompmemtestgoogle 13441 byron 20 0 1161m 1.1g 1220 R 194 28.2 0:55.15 ompmemtestgoogle 13441 byron 20 0 1161m 1.1g 1220 R 191 28.2 1:00.90 ompmemtestgoogle 13441 byron 20 0 1161m 1.1g 1220 R 191 28.2 1:06.64 ompmemtestgoogle 13441 byron 20 0 1161m 1.1g 1356 R 192 28.2 1:12.42 ompmemtestgoogle
从我的数据来看,答案似乎是:
如果所使用的堆库不能很好地处理并发访问,并且处理器无法真正并发地执行这些线程,则multithreading访问堆可以强调碎片 。
tcmalloc库显示没有重大的内存碎片运行相同的程序,以前造成〜400MB碎片中丢失。
但为什么会发生?
我必须在这里提供的最好的想法是堆内的某种locking工件。
testing程序将分配随机大小的内存块,释放程序中早期分配的块以保持其内存限制。 当一个线程正在释放位于左侧的堆块中的旧内存时,实际上可能会因为另一个线程计划运行而暂停,从而在该堆块上留下一个(软)锁。 新安排的线程想要分配内存,但是甚至可能不会读取“左侧”的堆块来检查当前正在更改的空闲内存。 因此,它可能最终会从“right”中不必要地使用新的堆块。
这个过程可能看起来像一个堆块移位,其中第一个块(左侧)仍然只是稀疏使用和碎片化,迫使新块在右侧使用。
让我们重申一下,如果我在一个只能同时处理两个线程的双核系统上使用4个或更多的线程,那么这个碎片问题只会发生在我身上。 当仅使用两个线程时,堆上的(软)锁将保持足够短的时间,而不会阻塞另一个想分配内存的线程。
另外,作为一个免责声明,我没有检查glibc堆实现的实际代码,在内存分配器领域我也不是什么新手 – 我所写的全部是纯粹的推测。
另一个有趣的读法可能是tcmalloc文档 ,它指出了堆和multithreading访问的常见问题,其中一些可能也在testing程序中发挥了作用。
值得注意的是,它永远不会将内存返回给系统(请参阅tcmalloc文档中的注意事项段落)