12. MPIクラスターを作ろう！ - 姫野ベンチを今度こそ動かす

前回からのつづきです。
11. MPIクラスターを作ろう！ - 姫野ベンチを動かす - 電子計算記

簡単だろうとはじめた姫野ベンチですが結構難しいものですね。
CPUクロックが可変だと計測時間がえらく長くなってしまったり、計測値が怪しくなる課題について2つのアプローチをとってみます。

1) 計測時間を調整する
2) クロックキャッピングの影響がすくないマシンタイプで実行する

1) 計測時間を調整する

計算回数(nn)の計算は、himenoBMTxps.cの133行目
nn= (int)(target/(cpu/3.0));
で決まってます。cpuは3回計算したときにかかった時間（秒）です。
targetの値が80行目で60.0が入っているのでだいたい1分くらいの時間になります。

ということで、物理クロックは2.6GHz、キャッピングがかかって800MHzなのでざっくり3倍の差があるので、targetの値を20.0に書き換えて実行してみます。

2プロセスで実行すると、

mpiuser@compute-1:/nfs/himeno$ time mpirun -np 2 --hostfile ~/my_hosts ./bmt 
Sequential version array size
 mimax = 129 mjmax = 129 mkmax = 257
Parallel version array size
 mimax = 129 mjmax = 129 mkmax = 131
imax = 128 jmax = 128 kmax =129
I-decomp = 1 J-decomp = 1 K-decomp =2
 Start rehearsal measurement process.
 Measure the performance in 3 times.

 MFLOPS: 6740.466122 time(s): 0.061022 1.667103e-03

 Now, start the actual measurement process.
 The loop will be excuted in 2949 times
 This will take about one minute.
 Wait for a while

cpu : 66.196050 sec.
Loop executed for 2949 times
Gosa : 3.322949e-04 
MFLOPS measured : 6107.963338
Score based on Pentium III 600MHz : 73.732054

real	1m7.043s
user	2m12.480s
sys	0m0.112s

つづいて、4プロセスで実行すると、

mpiuser@compute-1:/nfs/himeno$ time mpirun -np 4 --hostfile ~/my_hosts ./bmt 
Sequential version array size
 mimax = 129 mjmax = 129 mkmax = 257
Parallel version array size
 mimax = 129 mjmax = 67 mkmax = 131
imax = 128 jmax = 65 kmax =129
I-decomp = 1 J-decomp = 2 K-decomp =2
 Start rehearsal measurement process.
 Measure the performance in 3 times.

 MFLOPS: 10178.064077 time(s): 0.040412 1.702009e-03

 Now, start the actual measurement process.
 The loop will be excuted in 4454 times
 This will take about one minute.
 Wait for a while

cpu : 77.765181 sec.
Loop executed for 4454 times
Gosa : 1.873346e-04 
MFLOPS measured : 7852.695350
Score based on Pentium III 600MHz : 94.793522

real	1m18.557s
user	2m10.800s
sys	0m24.620s

ということで現実的な時間で計測を終えることができました。
スコアも向上しています。まとめると、

2プロセス

	target=60.0	target=20.0
MFLOPS	550	6107
real	10min	1min

4プロセス

	target=60.0	target=20.0
MFLOPS	116	7852
real	10min	1min

スコアも上がっていますし、一見正しいようにみえますが、実行時間というか計算回数によってスコアが変わるということは、他のtarget=60.0で計測したものとは比べられないということですね。

2) クロックキャッピングの影響がすくないマシンタイプで実行する

ということで、コードを書き換えずに計測するとなると、クロックキャッピングがないかほぼ影響ない環境で計測するしかなく、HighCPUのタイプであれば、物理CPU2.6GHz、スペック上も2.6GHで同じなのでいけそうです。
ということで、Light.S1からHighCPU.M4（2vCPU、4GBメモリ）にリサイズし、計測しました。

すると無事に計測できましたので、各サイズでプロセス数変えてグラフ化しました。ここでは、16台までマシンを用意しましたので、-np=32のときは1台のマシン上で2プロセス動いています（今回は2vCPUのマシンです）。

f:id:fujish:20180105231344p:plain

並列数が多くなるとスコアは下がっていく傾向があり、計算サイズが小さいほどその影響は顕著です。計算サイズが小さいと計算量も少なくなるので、MPI通信のオーバーヘッドの割合がどんどん大きくなっていくからと推測できます。

ちなみに、XLサイズでは4プロセス以下は、メモリ容量が不足し計測できませんでした。
また、XLサイズでコンパイルするときは、そのままだと以下のようなrelocation truncatedなんちゃらというエラーが出てきます。

mpiuser@compute-1:/nfs/himeno$ make clean; make
/bin/rm -f bmt himenoBMTxps.o core
mpicc -c -O3 himenoBMTxps.c
himenoBMTxps.c: In function ‘initcomm’:
himenoBMTxps.c:292:5: warning: implicit declaration of function ‘exit’ [-Wimplicit-function-declaration]
     exit(0);
     ^
himenoBMTxps.c:292:5: warning: incompatible implicit declaration of built-in function ‘exit’
himenoBMTxps.c:292:5: note: include ‘<stdlib.h>’ or provide a declaration of ‘exit’
mpicc -o bmt himenoBMTxps.o -O3
himenoBMTxps.o: In function `initmt':
himenoBMTxps.c:(.text+0x93): relocation truncated to fit: R_X86_64_32S against `.bss'
himenoBMTxps.c:(.text+0x9a): relocation truncated to fit: R_X86_64_32S against `.bss'
himenoBMTxps.c:(.text+0xa8): relocation truncated to fit: R_X86_64_32S against `.bss'
himenoBMTxps.c:(.text+0xb3): relocation truncated to fit: R_X86_64_32S against `.bss'
himenoBMTxps.c:(.text+0xd5): relocation truncated to fit: R_X86_64_32S against `.bss'
himenoBMTxps.c:(.text+0xdc): relocation truncated to fit: R_X86_64_32S against `.bss'
himenoBMTxps.c:(.text+0xe3): relocation truncated to fit: R_X86_64_32S against `.bss'
himenoBMTxps.c:(.text+0xee): relocation truncated to fit: R_X86_64_32S against `.bss'
himenoBMTxps.c:(.text+0x10c): relocation truncated to fit: R_X86_64_32S against `.bss'
himenoBMTxps.c:(.text+0x113): relocation truncated to fit: R_X86_64_32S against `.bss'
himenoBMTxps.c:(.text+0x11a): additional relocation overflows omitted from the output
collect2: error: ld returned 1 exit status
Makefile:13: recipe for target 'bmt' failed
make: *** [bmt] Error 1

そのときはMakefileの7行目に-mcmodel=largeを追加してからmakeしてください。
CFLAGS = -O3 -mcmodel=large

ということで、姫野ベンチはIDCFクラウドのクロックキャッピングがかかるマシンタイプとは相性が悪いということがわかりました。ほんとかな、ちょっと心配なので、もう少し確認してみます。今回はCのコードだったので、次回はFortran実装の姫野ベンチでも試してみましょう。

fujish.hateblo.jp