2018-02-01

17. MPIクラスターを作ろう！ - STREAMでハイブリッド並列を試す

前回からのつづきです。
16. MPIクラスターを作ろう！ - HPLのパラメータを検討 - 電子計算記

忘れてましたが、アドカレのコマ埋めの話のためにはじめたので、今回が最終回です。1ヶ月以上かかってしまいましたが。。。
今回はSTREAMをやります。HPC問わず定番のベンチマークソフトですし、シンプルで使いやすく様々な実装もあるので大変使いやすいのです。

www.cs.virginia.edu

STREAMといえば、メモリバンド幅のベンチマークによく使われます。MPIの実装もあるので、今回はそれを試してみます。

ビルドは簡単です。

mpiuser@compute-1:~$ mkdir /nfs/stream
mpiuser@compute-1:~$ cd /nfs/stream/
mpiuser@compute-1:/nfs/stream$
mpiuser@compute-1:/nfs/stream$ wget http://www.cs.virginia.edu/stream/FTP/Code/Versions/stream_mpi.c
mpiuser@compute-1:/nfs/stream$ mpicc -O3 stream_mpi.c

実行もこれまでと同じです。

mpiuser@compute-1:/nfs/stream$ mpirun -np 4 --hostfile ~/my_hosts ./a.out
-------------------------------------------------------------
STREAM version $Revision: 1.8 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Total Aggregate Array size = 10000000 (elements)
Total Aggregate Memory per array = 76.3 MiB (= 0.1 GiB).
Total Aggregate memory required = 228.9 MiB (= 0.2 GiB).
Data is distributed across 4 MPI ranks
   Array size per MPI rank = 2500000 (elements)
   Memory per array per MPI rank = 19.1 MiB (= 0.0 GiB).
   Total memory per MPI rank = 57.2 MiB (= 0.1 GiB).
-------------------------------------------------------------
Each kernel will be executed 10 times.
 The *best* time for each kernel (excluding the first iteration)
 will be used to compute the reported bandwidth.
The SCALAR value used for this run is 0.420000
-------------------------------------------------------------
Your timer granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 2361 microseconds.
   (= 2361 timer ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 timer ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:          43360.4     0.004048     0.003690     0.005795
Scale:         42361.4     0.004132     0.003777     0.005845
Add:           40603.1     0.006053     0.005911     0.006244
Triad:         38144.5     0.006475     0.006292     0.006926
-------------------------------------------------------------
Solution Validates: avg error less than 1.000000e-13 on all three arrays
-------------------------------------------------------------

最終的なスコアは、Best Rate MB/sの部分で、4種類のテストが実施されます。4つのなかでもベンチマーク用の値としてはTriadがよく用いられていて、加算と乗算をしているので、4つのテストの中では1番複雑な計算をしています。

この例では、Array sizeが10Mで、クラスタ全体で必要なメモリは0.2GiBになっています。STREAMはメモリバンド幅に依存しますので、Array sizeが小さいすなわち計算に利用するメモリが小さいとメモリではなく、ＣＰＵのキャッシュに収まってしまいメモリ本来の値より高速な結果となってしまいます。
そのため、マシンの搭載メモリやクラスタ全体の搭載メモリの容量に合わせて大きな値を設定する必要があります。これは以下のようにコンパイル時のオプションで指定できます。デフォルトより10倍大きい100Mの例です。

mpiuser@compute-1:/nfs/stream$ mpicc -O3 -DSTREAM_ARRAY_SIZE=100000000 stream_mpi.c

これを実行すると、クラスタ全体で 2.2 GiB 必要となります。
では、このArray sizeを変えて実行してみます。

Light.S1の結果
f:id:fujish:20180131012528p:plain

HighCPU.M4の結果
f:id:fujish:20180131012545p:plain

S1のスコアがM4と大差ないですが、これはほとんどのテストが1〜2秒で終わってしまい、クロックキャッピングがかかる前に実行完了してしまうためと考えられます。

また、S1、M4ともに-npが16以上で急に高速な結果となりますが、これはノード数が多くなりプロセスあたりに必要なメモリ容量が小さくなることで、CPUのキャッシュに収まってしまうためで、Array sizeが10Mのとき、1プロセスにおける必要なメモリサイズは 14.3 MiB になり、今回の環境の物理CPUが搭載しているL3キャッシュが 40 MiB なのですっぽり収まってしまいます。とは言っても仮想化されていて、他の仮想マシンと共有しているため完全にキャッシュに収まるかはわからないですし、そのときどきで変わります。

ということで、完全にキャッシュに載り切らないArray sizeを指定できれば、Array sizeによるスコアへの影響はほぼないように見えます。

ハイブリッド並列化

と、STREAMが何となくわかったところで、ここからが本題です。そうハイブリッド並列です。
これまで並列化の実装としてMPIを用いてきました。MPIはプロセスによる並列化手法です。
一方、ノード内のマルチコア環境ではスレッドによる並列化が軽量で高速なためよく用いられその実装としてはOpenMPが定番です。
このMPIとOpenMPを両方使って並列化し、ノード間はMPI,ノード内はOpenMPを使って並列化します。

実はMPI版のSTREAMは、このMPIとOpenMPのハイブリッド並列に対応しているので、上記と同じコードで簡単に試すことができます。

コンパイルはOpenMP対応のオプションとして、-fopenmpを追加するのみです。OpenMPのライブラリパッケージは、OpenMPIのインストール時にgccなどと一緒にインストールされています。

mpiuser@compute-1:/nfs/stream$ mpicc -O3 -fopenmp -DSTREAM_ARRAY_SIZE=100000000 stream_mpi.c

実行は、オプションを追加する必要があります。-bind-to board と入れることで、OpenMPのスレッドが各CPUコアを使えるようになります。boardのところは実行するマシン環境に合わせる必要が有り、IDCFクラウド環境やシングルソケットのマシンではboardを指定します。また、 -x OMP_NUM_THREADS=2 というように-xで環境変数が設定できるのでOpenMPのスレッド数も合わせて指定します。今回は2vCPUのHighCPU.M4のため2としています。

mpiuser@compute-1:/nfs/stream$ mpirun -np 16 -bind-to board -x OMP_NUM_THREADS=2 --hostfile ~/my_hosts ./a.out
-------------------------------------------------------------
STREAM version $Revision: 1.8 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Total Aggregate Array size = 100000000 (elements)
Total Aggregate Memory per array = 762.9 MiB (= 0.7 GiB).
Total Aggregate memory required = 2288.8 MiB (= 2.2 GiB).
Data is distributed across 16 MPI ranks
Array size per MPI rank = 6250000 (elements)
Memory per array per MPI rank = 47.7 MiB (= 0.0 GiB).
Total memory per MPI rank = 143.1 MiB (= 0.1 GiB).
-------------------------------------------------------------
Each kernel will be executed 10 times.
The *best* time for each kernel (excluding the first iteration)
will be used to compute the reported bandwidth.
The SCALAR value used for this run is 0.420000
-------------------------------------------------------------
Number of Threads requested for each MPI rank = 2
Number of Threads counted for rank 0 = 2
-------------------------------------------------------------
Your timer granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 2944 microseconds.
(= 2944 timer ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 timer ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function Best Rate MB/s Avg time Min time Max time
Copy: 203224.7 0.009135 0.007873 0.013324
Scale: 156833.1 0.010802 0.010202 0.014634
Add: 175671.5 0.014061 0.013662 0.016308
Triad: 176416.6 0.013954 0.013604 0.014882
-------------------------------------------------------------
Solution Validates: avg error less than 1.000000e-13 on all three arrays
-------------------------------------------------------------

> Number of Threads requested for each MPI rank = 2
> Number of Threads counted for rank 0 = 2
の部分が違うくらいであとは同じです。

では、MPIのみとハイブリッドだとどう結果が違ってくるのか、Array sizeを100Mにして、HighCPU.M4のマシンを並べて実行してみましょう。

f:id:fujish:20180201005721p:plain

Hybridのnpが2のときスレッド含めた並列数は4になるので、例えばHybridのnp 2と比較するのは同じ並列数のMPI Onlyのnp 4を比べてみます。するとだいたい同じ値になっており、多少Hybridの方がスコアが低くなっています。
一方、クラスタ搭載のCPUをフルに使うHybridのnp 16とMPI Onlyのnp 32だと、多少Hybridの方がスコアが高くなっています。

結果からすると、ノード内で１プロセスしか走らせないならMPI Onlyが良いけど、ノード内で搭載コア数分のプロセスを走らせるならスレッド使った方が高速になると言えそうです。
今回は2vCPUのマシンだったので差が小さかったですが、もっとコア数が多いとか、問題規模が大きくなるとその差はもっともっと大きくなると考えられます。

以上、簡単にハイブリッド並列が試せましたね。

※HighCPU,M4(2vCPU)のときの-np 16まではslots=1として各ノード１プロセスづつしか動かしてません

まとめ

ここまでやってきたとおり、IDCFクラウドのLight.S1を並べて、安価にMPIクラスタ環境を構築できました。一方で、クロックキャッピングによる性能や実行時間への影響は考慮する必要があります。また、さくっと上位タイプへリサイズすれば、ハイブリッド並列化環境も揃えられるので、MPIコードの開発に活用できるのではないでしょうか。

2018-01-29

16. MPIクラスターを作ろう！ - HPLのパラメータを検討

前回からのつづきです。
15. MPIクラスターを作ろう！ - HPLを動かしてみる - 電子計算記

前回実行したテストでは、4.915e-02GFLOPSつまり0.05GFLOPSであり、S1を1ノードだとしてもだいぶ遅い結果です。
ここからスコアをあげていくにはパラメータのチューニングが必要になってきます。

具体的には、HPL.datの中を編集し実行しを繰り返し、最大スコアを探していく必要があります。
このパラメータの意味や使い方は公式のチューニングのページにあります。

HPL Tuning

どんな値がよいかは、公式のFAQのページにあります。

HPL Frequently Asked Questions

ただ、読んでもよくわからないので、試していくしかないので、時間かかりますがどんどん動かしていきます。
ここではスコアへの影響が大きい最低限やるべきパラメータ、N、NB、P、Qについて見ていきます。

N

Nは問題のサイズで一番スコアに影響します。基本的には、使用可能なメモリ容量に依存し、大きいほど高いスコアが出ます。そして計算時間も長くなっていきます。
FAQには1GBで10000とありますので、1000から順に上げてスコアを見ていきます。

HPL.datのN=10000の例です。ここでは、NB=64、P=1、Q=4に固定し、Light.S1を4ノードで実行しました。

HPLinpack benchmark input file
Innovative Computing Laboratory, University of Tennessee
HPL.out      output file name (if any)
6            device out (6=stdout,7=stderr,file)
1            # of problems sizes (N)
10000        Ns
1            # of NBs
64           NBs
0            PMAP process mapping (0=Row-,1=Column-major)
1            # of process grids (P x Q)
1            Ps
4            Qs
16.0         threshold
3            # of panel fact
0 1 2        PFACTs (0=left, 1=Crout, 2=Right)
2            # of recursive stopping criterium
2 4          NBMINs (>= 1)
1            # of panels in recursion
2            NDIVs
3            # of recursive panel fact.
0 1 2        RFACTs (0=left, 1=Crout, 2=Right)
1            # of broadcast
0            BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
1            # of lookahead depth
0            DEPTHs (>=0)
2            SWAP (0=bin-exch,1=long,2=mix)
64           swapping threshold
0            L1 in (0=transposed,1=no-transposed) form
0            U  in (0=transposed,1=no-transposed) form
1            Equilibration (0=no,1=yes)
8            memory alignment in double (> 0)

実行は特に変わったことはありません。

mpiuser@compute-1:/nfs/hpl-2.2/bin/Linux_PII_CBLAS_gm$ mpirun -np 4 --hostfile ~/my_hosts ./xhpl

結果です。1回の実行で18テストの結果がでますが、その中の1番高いスコアを用いました。

f:id:fujish:20180129002340p:plain

N=20000だとメモリ不足で実行できませんでした。
Nを大きくしていけば、スコアも大きくなるはずですが、なぜかN=1000とか2000の方がスコアが良い結果に。
N=1000のときは数秒で実行完了しますが、N=19000のときは5時間以上かかります。
そのため、これまでもあったようにクロックキャッピングの影響と考えられます。

というわけで、クロックキャッピングの影響が少ないHighCPU.M4を4ノードで同じく実行していみます。slots=1として4ノードで4プロセスを動かします。

f:id:fujish:20180129003200p:plain

結果は順当に上がり、Nは大きければ大きいほど良いようです。
ちなみにN=45000はメモリ不足なり、N=40000のときは3時間弱計算にかかりました。

ここまでの最良のスコアとしては
Light.S1は N=19000 のとき 4.77e+00 GFLOPS
HighCPU.M4は N=40000 のとき 8.14e+01 GFLOPS
となりました。

NB

NBはブロックサイズです。この値は、大きすぎてもだめ、小さすぎてもだめです。
FAQによると32から256の間が良さそうとのことで、NBの値を変えて動かしてみます。
Nは上記より、S1はN=19000、M4はN=40000としています。

では、Light.S1の結果から。

f:id:fujish:20180129004735p:plain

NB=256が一番良かったですが、NB=512はメモリ不足で実行できませんでした。

次に、HighCPU.M4の結果。

f:id:fujish:20180129004833p:plain

NB=256が一番良かったです。

ということで、ここまでの最良のスコアは、
Light.S1は N=19000, NB=256 のとき 8.72E+00 GFLOPS
HighCPU.M4は N=40000, NB=256 のとき 1.03e+02 GFLOPS
となりました。

PとQ

最後に、PとQですが、これはプロセスグリッドの数です。なのでmpirunするときのnpの値に揃える必要があります。
今回は４ノードなので、1*4、2*2、4*1の３パターンが考えられます。
FAQによると、1*4, 1*8, 2*4とQの方を大きく平坦にするのが良いそうです。

Light.S1の結果です。

f:id:fujish:20180129005743p:plain

HighCPU.M4の結果です。

f:id:fujish:20180129005825p:plain

いずれもP*Q=1*4が一番良かったです。

最終的な結果としては変わらず、-np 4で４ノード実行したときは
Light.S1は N=19000, NB=256 のとき 8.72E+00 GFLOPS
HighCPU.M4は N=40000, NB=256 のとき 1.03e+02 GFLOPS
となりました。

※HighCPU,M4(2vCPU)のときの-np 4で4ノード実行のときはslots=1として各ノード１プロセスづつしか動かしてません

以上で、HPLの回は終わりにしようと思います。パラメータのチューニングだけでなく、BLASのパッケージ選択など、本気でやるとキリがないですね。TOP500のスコアをとるのは相当大変なんですね。
次で、最後のコマですが、最後も定番のSTREAMにしようかなと思ってます。

fujish.hateblo.jp

2018-01-11

15. MPIクラスターを作ろう！ - HPLを動かしてみる

前回からの続き
14. MPIクラスターを作ろう！ - qn24bを動かしてみる - 電子計算記

スパコンと言えばTOP500！

Home | TOP500 Supercomputer Sites

TOP500と言えばLinpack！

LINPACK - Wikipedia

ということで、LinpackのMPIによる並列実装のHPLを動かしてTOP500のようなベンチマークスコアをはじき出しましょう。

ただ、本気でやるとキリがないほど奥深いので、時短で簡単に実行するやりかたを紹介します。

Linpackの根幹である行列演算、このライブラリとしてHPLはBLASを使うのでまずはその準備から。
本気で性能を出したい場合は、利用している環境に合わせてビルドするわけですが、ここは時短でUbuntu標準パッケージを利用します。

BLASの最近のOSS実装だと、OpenBLASかATLASが定番。Ubuntu16.04だと両方パッケージあるのでどっち使ってもよいですが、ここではOpenBLASの例ですすめます。

まずは、ビルド環境としてcompute-1を使うとして、OpenBLASをインストールします。

root@compute-1:~# apt install libopenblas-dev -y

これだけで/usr/lib配下にすぐに使えるOpenBLASのライブラリがインストールされます。

では下準備が整ったので、HPLをビルドしていきます。まずはファイルのダウンロード。オフィシャルからとってきます。

mpiuser@compute-1:~$ wget http://www.netlib.org/benchmark/hpl/hpl-2.2.tar.gz
mpiuser@compute-1:~$ tar zxf hpl-2.2.tar.gz -C /nfs/
mpiuser@compute-1:~$ cd /nfs/hpl-2.2/

環境ごとのビルド用のMakefileのサンプルがsetupディレクトリの中にありますので、ここでは一番変更の少ないMake.Linux_PII_CBLAS_gmをベースとして編集します。

mpiuser@compute-1:/nfs/hpl-2.2$ cp ./setup/Make.Linux_PII_CBLAS_gm ./
mpiuser@compute-1:/nfs/hpl-2.2$ vi Make.Linux_PII_CBLAS_gm

編集箇所としては、
70行目のTOPdirをHPLを展開したディレクトリパスを指定
TOPdir = /nfs/hpl-2.2
95行目のLAdirをBLASをインストールしたライブラリのディレクトリパスを指定
LAdir = /usr/lib
97行目のLAlibをBLASをインストールしたライブラリのファイル自身を指定
LAlib = $(LAdir)/libopenblas.a
これだけです。（前回までの流れで、OpenMPI、gfortranをaptインストールしている前提）

わかりにくいかもなので全部のっけておきます。
/nfs/hpl-2.2/Make.Linux_PII_CBLAS_gm

#  
#  -- High Performance Computing Linpack Benchmark (HPL)                
#     HPL - 2.2 - February 24, 2016                          
#     Antoine P. Petitet                                                
#     University of Tennessee, Knoxville                                
#     Innovative Computing Laboratory                                 
#     (C) Copyright 2000-2008 All Rights Reserved                       
#                                                                       
#  -- Copyright notice and Licensing terms:                             
#                                                                       
#  Redistribution  and  use in  source and binary forms, with or without
#  modification, are  permitted provided  that the following  conditions
#  are met:                                                             
#                                                                       
#  1. Redistributions  of  source  code  must retain the above copyright
#  notice, this list of conditions and the following disclaimer.        
#                                                                       
#  2. Redistributions in binary form must reproduce  the above copyright
#  notice, this list of conditions,  and the following disclaimer in the
#  documentation and/or other materials provided with the distribution. 
#                                                                       
#  3. All  advertising  materials  mentioning  features  or  use of this
#  software must display the following acknowledgement:                 
#  This  product  includes  software  developed  at  the  University  of
#  Tennessee, Knoxville, Innovative Computing Laboratory.             
#                                                                       
#  4. The name of the  University,  the name of the  Laboratory,  or the
#  names  of  its  contributors  may  not  be used to endorse or promote
#  products  derived   from   this  software  without  specific  written
#  permission.                                                          
#                                                                       
#  -- Disclaimer:                                                       
#                                                                       
#  THIS  SOFTWARE  IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
#  ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES,  INCLUDING,  BUT NOT
#  LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
#  A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE UNIVERSITY
#  OR  CONTRIBUTORS  BE  LIABLE FOR ANY  DIRECT,  INDIRECT,  INCIDENTAL,
#  SPECIAL,  EXEMPLARY,  OR  CONSEQUENTIAL DAMAGES  (INCLUDING,  BUT NOT
#  LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
#  DATA OR PROFITS; OR BUSINESS INTERRUPTION)  HOWEVER CAUSED AND ON ANY
#  THEORY OF LIABILITY, WHETHER IN CONTRACT,  STRICT LIABILITY,  OR TORT
#  (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
#  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 
# ######################################################################
#  
# ----------------------------------------------------------------------
# - shell --------------------------------------------------------------
# ----------------------------------------------------------------------
#
SHELL        = /bin/sh
#
CD           = cd
CP           = cp
LN_S         = ln -s
MKDIR        = mkdir
RM           = /bin/rm -f
TOUCH        = touch
#
# ----------------------------------------------------------------------
# - Platform identifier ------------------------------------------------
# ----------------------------------------------------------------------
#
ARCH         = Linux_PII_CBLAS_gm
#
# ----------------------------------------------------------------------
# - HPL Directory Structure / HPL library ------------------------------
# ----------------------------------------------------------------------
#
TOPdir       = /nfs/hpl-2.2
INCdir       = $(TOPdir)/include
BINdir       = $(TOPdir)/bin/$(ARCH)
LIBdir       = $(TOPdir)/lib/$(ARCH)
#
HPLlib       = $(LIBdir)/libhpl.a 
#
# ----------------------------------------------------------------------
# - Message Passing library (MPI) --------------------------------------
# ----------------------------------------------------------------------
# MPinc tells the  C  compiler where to find the Message Passing library
# header files,  MPlib  is defined  to be the name of  the library to be
# used. The variable MPdir is only used for defining MPinc and MPlib.
#
MPdir        =
MPinc        =
MPlib        =
#
# ----------------------------------------------------------------------
# - Linear Algebra library (BLAS or VSIPL) -----------------------------
# ----------------------------------------------------------------------
# LAinc tells the  C  compiler where to find the Linear Algebra  library
# header files,  LAlib  is defined  to be the name of  the library to be
# used. The variable LAdir is only used for defining LAinc and LAlib.
#
LAdir        = /usr/lib
LAinc        =
LAlib        = $(LAdir)/libopenblas.a
#
# ----------------------------------------------------------------------
# - F77 / C interface --------------------------------------------------
# ----------------------------------------------------------------------
# You can skip this section  if and only if  you are not planning to use
# a  BLAS  library featuring a Fortran 77 interface.  Otherwise,  it  is
# necessary  to  fill out the  F2CDEFS  variable  with  the  appropriate
# options.  **One and only one**  option should be chosen in **each** of
# the 3 following categories:
#
# 1) name space (How C calls a Fortran 77 routine)
#
# -DAdd_              : all lower case and a suffixed underscore  (Suns,
#                       Intel, ...),                           [default]
# -DNoChange          : all lower case (IBM RS6000),
# -DUpCase            : all upper case (Cray),
# -DAdd__             : the FORTRAN compiler in use is f2c.
#
# 2) C and Fortran 77 integer mapping
#
# -DF77_INTEGER=int   : Fortran 77 INTEGER is a C int,         [default]
# -DF77_INTEGER=long  : Fortran 77 INTEGER is a C long,
# -DF77_INTEGER=short : Fortran 77 INTEGER is a C short.
#
# 3) Fortran 77 string handling
#
# -DStringSunStyle    : The string address is passed at the string loca-
#                       tion on the stack, and the string length is then
#                       passed as  an  F77_INTEGER  after  all  explicit
#                       stack arguments,                       [default]
# -DStringStructPtr   : The address  of  a  structure  is  passed  by  a
#                       Fortran 77  string,  and the structure is of the
#                       form: struct {char *cp; F77_INTEGER len;},
# -DStringStructVal   : A structure is passed by value for each  Fortran
#                       77 string,  and  the  structure is  of the form:
#                       struct {char *cp; F77_INTEGER len;},
# -DStringCrayStyle   : Special option for  Cray  machines,  which  uses
#                       Cray  fcd  (fortran  character  descriptor)  for
#                       interoperation.
#
F2CDEFS      =
#
# ----------------------------------------------------------------------
# - HPL includes / libraries / specifics -------------------------------
# ----------------------------------------------------------------------
#
HPL_INCLUDES = -I$(INCdir) -I$(INCdir)/$(ARCH) $(LAinc) $(MPinc)
HPL_LIBS     = $(HPLlib) $(LAlib) $(MPlib)
#
# - Compile time options -----------------------------------------------
#
# -DHPL_COPY_L           force the copy of the panel L before bcast;
# -DHPL_CALL_CBLAS       call the cblas interface;
# -DHPL_CALL_VSIPL       call the vsip  library;
# -DHPL_DETAILED_TIMING  enable detailed timers;
#
# By default HPL will:
#    *) not copy L before broadcast,
#    *) call the BLAS Fortran 77 interface,
#    *) not display detailed timing information.
#
HPL_OPTS     = -DHPL_CALL_CBLAS
#
# ----------------------------------------------------------------------
#
HPL_DEFS     = $(F2CDEFS) $(HPL_OPTS) $(HPL_INCLUDES)
#
# ----------------------------------------------------------------------
# - Compilers / linkers - Optimization flags ---------------------------
# ----------------------------------------------------------------------
#
CC           = mpicc
CCNOOPT      = $(HPL_DEFS)
CCFLAGS      = $(HPL_DEFS) -fomit-frame-pointer -O3 -funroll-loops -W -Wall
#
# On some platforms,  it is necessary  to use the Fortran linker to find
# the Fortran internals used in the BLAS library.
#
LINKER       = mpif77
LINKFLAGS    = $(CCFLAGS)
#
ARCHIVER     = ar
ARFLAGS      = r
RANLIB       = echo
#
# ----------------------------------------------------------------------

Makefileができあがればあとはmakeするだけです。

mpiuser@compute-1:/nfs/hpl-2.2$ make arch=Linux_PII_CBLAS_gm
mpiuser@compute-1:/nfs/hpl-2.2$ cd bin/Linux_PII_CBLAS_gm/
mpiuser@compute-1:/nfs/hpl-2.2/bin/Linux_PII_CBLAS_gm$ ls -alh
total 22M
drwxrwxr-x 2 mpiuser mpiuser 4.0K Jan 10 00:35 .
drwxrwxr-x 3 mpiuser mpiuser 4.0K Jan 10 00:35 ..
-rw-r--r-- 1 mpiuser mpiuser 1.2K Jan 10 00:35 HPL.dat
-rwxrwxr-x 1 mpiuser mpiuser  22M Jan 10 00:35 xhpl

無事ビルドに成功するとbin/(arch)/の中にxhplというバイナリが出来上がっています。
では最後にREADMEにあるとおりテストしてみましょう。

mpiuser@compute-1:/nfs/hpl-2.2/bin/Linux_PII_CBLAS_gm$ mpirun -np 4 ./xhpl
〜省略〜
================================================================================
T/V                N    NB     P     Q               Time                 Gflops
--------------------------------------------------------------------------------
WR00R2R2          35     4     4     1               0.00              4.775e-02
HPL_pdgesv() start time Wed Jan 10 00:38:25 2018

HPL_pdgesv() end time   Wed Jan 10 00:38:25 2018

--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=        0.0247304 ...... PASSED
================================================================================
T/V                N    NB     P     Q               Time                 Gflops
--------------------------------------------------------------------------------
WR00R2R4          35     4     4     1               0.00              4.915e-02
HPL_pdgesv() start time Wed Jan 10 00:38:25 2018

HPL_pdgesv() end time   Wed Jan 10 00:38:25 2018

--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=        0.0199397 ...... PASSED
================================================================================

Finished    864 tests with the following results:
            864 tests completed and passed residual checks,
              0 tests completed and failed residual checks,
              0 tests skipped because of illegal input values.
--------------------------------------------------------------------------------

End of Tests.
================================================================================

テストでは4プロセス以上が必要でノード内でまずは動かしてみましょう。
failedやskippedがなければ成功です。
Light.S1でも1秒程度で実行完了できると思います。
しかし、複数ノードでLight.S1を4台で動かすと同じテストを終えるまでに30分以上かかります。

ということで、次回はHPLの実行パラメータのチューニングにせまっていきたいと思います。

fujish.hateblo.jp

2018-01-09

14. MPIクラスターを作ろう！ - qn24bを動かしてみる

前回からのつづきです。
13. MPIクラスターを作ろう！ - 姫野ベンチをもう少し動かしてみる - 電子計算記

前回までの姫野ベンチはちょっと難しかったので、もっと簡単にノードを増やすとスケールするのがわかるようなベンチマークを動かしてみましょう。

ということで、今回はqn24bです。Nクイーン問題が何かはリンク先をどうぞ。
N-queens

あまりメジャーじゃないかもしれませんが、私は10年以上前から愛用しています。というのも

・コードがシンプルでわかりやすい
・逐次版、OpenMP版、MPI版といろいろな実装がある
・OpenMP版は単純な分割による並列化、MPI版はMaster-Worker型による並列化と様々な実装がとれる
・並列化後のプロセス/スレッド間のデータのやりとりがほぼないので、スケールしやすい
→ということで、並列分散プログラミングの勉強には最適な題材なのです。
　新しい並列計算環境がでてきたら、まずはこのqn24bをポーティングしたりもしてます。

ベンチマークソフトとしても特徴的で、この手のベンチマークツールは浮動小数点演算の処理性能を計測するものが多い中、qn24bは整数演算が主なのです。

では早速はじめていきましょう。

mpiuser@compute-1:~$ wget http://www.arch.cs.titech.ac.jp/~kise/nq/package/qn24b-version1.0.tgz
mpiuser@compute-1:~$ mkdir /nfs/qn24b
mpiuser@compute-1:~$ tar zxf qn24b-version1.0.tgz -C /nfs/qn24b
mpiuser@compute-1:~$ cd /nfs/qn24b/version1.0/mpi/

次にmakeしたいところですが、1点修正が必要です。
Makefileの9行目の-staticオプションを削除してください。
mpicc -Wall -O2 $(SRC) -o $(TRG)
あとはmakeするだけです。

mpiuser@compute-1:/nfs/qn24b/version1.0/mpi$ make

実行は、問題サイズ（Nクイーン問題のNの部分）を入れるのが必要になります。このNの値が大きいほど、計算量も大きくなり時間がかかります。
例えば16クイーン問題を4プロセスで動かすときは以下になります。

mpiuser@compute-1:/nfs/qn24b/version1.0/mpi$ mpirun -np 4 --hostfile ~/my_hosts ./qn24b_mpi 16
〜省略〜
003 : 09839 09844 0000000000001050 099.95 00000.00 Tue Jan  9 00:02:14 2018
002 : 09841 09844 0000000000000944 099.96 00000.00 Tue Jan  9 00:02:14 2018
001 : 09840 09844 0000000000001160 099.97 00000.00 Tue Jan  9 00:02:14 2018
003 : 09842 09844 0000000000000994 099.98 00000.00 Tue Jan  9 00:02:14 2018
002 : 09843 09844 0000000000000878 099.99 00000.00 Tue Jan  9 00:02:14 2018
001 : 09844 09844 0000000000000650 100.00 00000.00 Tue Jan  9 00:02:14 2018
=============================================
qn24b MPI version 1.0.0 2004-06-16
problem size n        : 16
total   solutions     : 14772512
correct solutions     : 14772512
million solutions/sec : 6.018
elapsed time (sec)    : 2.455
=============================================

total solutionsとcorrect solutions が一致していれば計算処理は成功です。
ベンチマークのスコアとして秒あたりの処理数であるmillion solutions/sec（値が大きいほど高速）か、処理時間のelapsed time (sec)（値が小さいほど高速）にあたります。
このとき、Master-Worker型で動くので実際に計算処理しているのは4プロセス中3プロセスということになります。

では、Light.S1とHighCPU.M4をそれぞれ16台づつ並べた結果をグラフ化してみましょう。

f:id:fujish:20180109010640p:plain
f:id:fujish:20180109010821p:plain

基本的にはノード数（プロセス数）に応じてスコアがあがっていくのがわかると思います。
その中でいくつか気になる点が出てきます。

1)　N=15のプロセス数に応じた伸びが悪い
Light.S1もHighCPU.M4も同じ傾向にあると思いますがこれは問題規模が小さすぎるせいです。
Nはある程度大きくしないとですが、N=24とか多きすぎると現実的な時間で処理しきれません。

2) Light.S1のN=16のnp4以上やN=17のnp16のときの値がやけに高速
処理時間が2-3秒以下だとクロックキャッピングがかかる前に処理が終わるため高速になります。HighCPU.M4に近い値になってます。

ということで、ここでもクロックキャッピングの影響が見えました。一方で並列化の処理方式上、前回までの姫野ベンチと違って綺麗にスケールしていくことが見えました。
次回はいよいよスパコンベンチマークの定番HPLをやっていきましょう。

fujish.hateblo.jp

2018-01-06

13. MPIクラスターを作ろう！ - 姫野ベンチをもう少し動かしてみる

前回からのつづきです。
12. MPIクラスターを作ろう！ - 姫野ベンチを今度こそ動かす - 電子計算記

姫野ベンチ難しいですね。
前回は C + MPI, static allocate version を使ってましたが、
今回は Fortran90 + MPI をやってみましょう。

これまでのインストールの流れだと、Fortranの実行環境やビルド環境はインストールされていないのでまずはそこから。全ノードでインストールするか1台に入れてテンプレートから複製してください。

root@compute-1:~# apt install gfortran -y

Cのときと同じようにもってきて展開します。

mpiuser@compute-1:~$ wget http://accc.riken.jp/wp-content/uploads/2015/07/f90_xp_mpi.zip
mpiuser@compute-1:~$ unzip f90_xp_mpi.zip 
mpiuser@compute-1:~$ lha xw=/nfs/himeno-f90 f90_xp_mpi.lzh

コンパイルは、ここではCのときにならって-O3の最適化だけ入れます。

mpiuser@compute-1:~$ cd /nfs/himeno-f90/
mpiuser@compute-1:/nfs/himeno-f90$ mpif90 -O3 himenoBMTxpr.f90

実行もCのときと同じです。ただ、今回はstatic allocate versionではないので実行後に入力します。ここでは、Mサイズの4並列の例。

mpiuser@compute-1:/nfs/himeno-f90$ mpirun -np 4 --hostfile ~/my_hosts /nfs/himeno-f90/a.out 
 For example:
 Grid-size= 
            XS  (64x32x32)
            S   (128x64x64)
            M   (256x128x128)
            L   (512x256x256)
            XL  (1024x512x512)
  Grid-size = 
M

 For example: 
 DDM pattern= 
      1 1 2
      i-direction partitioning : 1
      j-direction partitioning : 1
      k-direction partitioning : 2
  DDM pattern = 
1 2 2

 Sequential version array size
  mimax=         257  mjmax=         129  mkmax=         129
 Parallel version  array size
  mimax=         257  mjmax=          66  mkmax=          66
  imax=         256  jmax=          65  kmax=          65
  I-decomp=            1  J-decomp=            2  K-decomp=            2

  Start rehearsal measurement process.
  Measure the performance in 3 times.
   MFLOPS:   9023.0592584404149        time(s):   4.5584917068481445E-002   1.70304556E-03
 Now, start the actual measurement process.
 The loop will be excuted in        3948  times.
 This will take about one minute.
 Wait for a while.
  Loop executed for         3948  times
  Gosa :   2.27055614E-04
  MFLOPS:   7828.2245009397411        time(s):   69.146085023880005     
  Score based on Pentium III 600MHz :   94.4981308

では、前回と同じように、HighCPU.M4を16台ならべた結果のグラフです。

f:id:fujish:20180105234752p:plain

前回のCのときと同じ傾向ですね。
Lサイズの結果を比べると、

f:id:fujish:20180105235024p:plain

スコアに差が出てCの方が高速ですが、傾向は同じですね。
（どっちの言語が速い遅いの話になるほど調べてはいないのであしからず）

また、Light.S1で実行するとやはり、計測時間はCのときと同様に長くなってしまいます。

ということで、言語による実装を変えても傾向は変わらなかったので、姫野ベンチはクロックキャッピングのマシンタイプとの相性は悪いですね。次回はもう少し簡単なというかシンプルなベンチマークソフトを試してみましょう。

fujish.hateblo.jp

2018-01-05

12. MPIクラスターを作ろう！ - 姫野ベンチを今度こそ動かす

前回からのつづきです。
11. MPIクラスターを作ろう！ - 姫野ベンチを動かす - 電子計算記

簡単だろうとはじめた姫野ベンチですが結構難しいものですね。
CPUクロックが可変だと計測時間がえらく長くなってしまったり、計測値が怪しくなる課題について2つのアプローチをとってみます。

1) 計測時間を調整する
2) クロックキャッピングの影響がすくないマシンタイプで実行する

1) 計測時間を調整する

計算回数(nn)の計算は、himenoBMTxps.cの133行目
nn= (int)(target/(cpu/3.0));
で決まってます。cpuは3回計算したときにかかった時間（秒）です。
targetの値が80行目で60.0が入っているのでだいたい1分くらいの時間になります。

ということで、物理クロックは2.6GHz、キャッピングがかかって800MHzなのでざっくり3倍の差があるので、targetの値を20.0に書き換えて実行してみます。

2プロセスで実行すると、

mpiuser@compute-1:/nfs/himeno$ time mpirun -np 2 --hostfile ~/my_hosts ./bmt 
Sequential version array size
 mimax = 129 mjmax = 129 mkmax = 257
Parallel version array size
 mimax = 129 mjmax = 129 mkmax = 131
imax = 128 jmax = 128 kmax =129
I-decomp = 1 J-decomp = 1 K-decomp =2
 Start rehearsal measurement process.
 Measure the performance in 3 times.

 MFLOPS: 6740.466122 time(s): 0.061022 1.667103e-03

 Now, start the actual measurement process.
 The loop will be excuted in 2949 times
 This will take about one minute.
 Wait for a while

cpu : 66.196050 sec.
Loop executed for 2949 times
Gosa : 3.322949e-04 
MFLOPS measured : 6107.963338
Score based on Pentium III 600MHz : 73.732054

real	1m7.043s
user	2m12.480s
sys	0m0.112s

つづいて、4プロセスで実行すると、

mpiuser@compute-1:/nfs/himeno$ time mpirun -np 4 --hostfile ~/my_hosts ./bmt 
Sequential version array size
 mimax = 129 mjmax = 129 mkmax = 257
Parallel version array size
 mimax = 129 mjmax = 67 mkmax = 131
imax = 128 jmax = 65 kmax =129
I-decomp = 1 J-decomp = 2 K-decomp =2
 Start rehearsal measurement process.
 Measure the performance in 3 times.

 MFLOPS: 10178.064077 time(s): 0.040412 1.702009e-03

 Now, start the actual measurement process.
 The loop will be excuted in 4454 times
 This will take about one minute.
 Wait for a while

cpu : 77.765181 sec.
Loop executed for 4454 times
Gosa : 1.873346e-04 
MFLOPS measured : 7852.695350
Score based on Pentium III 600MHz : 94.793522

real	1m18.557s
user	2m10.800s
sys	0m24.620s

ということで現実的な時間で計測を終えることができました。
スコアも向上しています。まとめると、

2プロセス

	target=60.0	target=20.0
MFLOPS	550	6107
real	10min	1min

4プロセス

	target=60.0	target=20.0
MFLOPS	116	7852
real	10min	1min

スコアも上がっていますし、一見正しいようにみえますが、実行時間というか計算回数によってスコアが変わるということは、他のtarget=60.0で計測したものとは比べられないということですね。

2) クロックキャッピングの影響がすくないマシンタイプで実行する

ということで、コードを書き換えずに計測するとなると、クロックキャッピングがないかほぼ影響ない環境で計測するしかなく、HighCPUのタイプであれば、物理CPU2.6GHz、スペック上も2.6GHで同じなのでいけそうです。
ということで、Light.S1からHighCPU.M4（2vCPU、4GBメモリ）にリサイズし、計測しました。

すると無事に計測できましたので、各サイズでプロセス数変えてグラフ化しました。ここでは、16台までマシンを用意しましたので、-np=32のときは1台のマシン上で2プロセス動いています（今回は2vCPUのマシンです）。

f:id:fujish:20180105231344p:plain

並列数が多くなるとスコアは下がっていく傾向があり、計算サイズが小さいほどその影響は顕著です。計算サイズが小さいと計算量も少なくなるので、MPI通信のオーバーヘッドの割合がどんどん大きくなっていくからと推測できます。

ちなみに、XLサイズでは4プロセス以下は、メモリ容量が不足し計測できませんでした。
また、XLサイズでコンパイルするときは、そのままだと以下のようなrelocation truncatedなんちゃらというエラーが出てきます。

mpiuser@compute-1:/nfs/himeno$ make clean; make
/bin/rm -f bmt himenoBMTxps.o core
mpicc -c -O3 himenoBMTxps.c
himenoBMTxps.c: In function ‘initcomm’:
himenoBMTxps.c:292:5: warning: implicit declaration of function ‘exit’ [-Wimplicit-function-declaration]
     exit(0);
     ^
himenoBMTxps.c:292:5: warning: incompatible implicit declaration of built-in function ‘exit’
himenoBMTxps.c:292:5: note: include ‘<stdlib.h>’ or provide a declaration of ‘exit’
mpicc -o bmt himenoBMTxps.o -O3
himenoBMTxps.o: In function `initmt':
himenoBMTxps.c:(.text+0x93): relocation truncated to fit: R_X86_64_32S against `.bss'
himenoBMTxps.c:(.text+0x9a): relocation truncated to fit: R_X86_64_32S against `.bss'
himenoBMTxps.c:(.text+0xa8): relocation truncated to fit: R_X86_64_32S against `.bss'
himenoBMTxps.c:(.text+0xb3): relocation truncated to fit: R_X86_64_32S against `.bss'
himenoBMTxps.c:(.text+0xd5): relocation truncated to fit: R_X86_64_32S against `.bss'
himenoBMTxps.c:(.text+0xdc): relocation truncated to fit: R_X86_64_32S against `.bss'
himenoBMTxps.c:(.text+0xe3): relocation truncated to fit: R_X86_64_32S against `.bss'
himenoBMTxps.c:(.text+0xee): relocation truncated to fit: R_X86_64_32S against `.bss'
himenoBMTxps.c:(.text+0x10c): relocation truncated to fit: R_X86_64_32S against `.bss'
himenoBMTxps.c:(.text+0x113): relocation truncated to fit: R_X86_64_32S against `.bss'
himenoBMTxps.c:(.text+0x11a): additional relocation overflows omitted from the output
collect2: error: ld returned 1 exit status
Makefile:13: recipe for target 'bmt' failed
make: *** [bmt] Error 1

そのときはMakefileの7行目に-mcmodel=largeを追加してからmakeしてください。
CFLAGS = -O3 -mcmodel=large

ということで、姫野ベンチはIDCFクラウドのクロックキャッピングがかかるマシンタイプとは相性が悪いということがわかりました。ほんとかな、ちょっと心配なので、もう少し確認してみます。今回はCのコードだったので、次回はFortran実装の姫野ベンチでも試してみましょう。

fujish.hateblo.jp

2018-01-04

11. MPIクラスターを作ろう！ - 姫野ベンチを動かす

前回からのつづきです。
10. MPIクラスターを作ろう！ - NFSクライアント設定 - 電子計算記

NFSストレージサーバーが用意できたので、MPIクラスター環境として最低限動作するところまで構築できたので、実際にあるベンチマークソフトウェアを使ってクラスターの性能を測定してみましょう。

1つ目はHPC想定のベンチマークとしては定番の姫野ベンチをやります。

姫野ベンチマーク | 理化学研究所情報基盤センター（所内および所外向け）

姫野ベンチが何かはリンク先を見てもらうとして、姫野ベンチの良いところの1つにCやFortran、MPIやOpenMPなど様々な実装が公式にあることです。
というわけで、早速MPI版をダウンロードしコンパイルしてMPI実行してみましょう。

今回は C + MPI, static allocate version を利用します。
C言語で書かれて、MPIで並列化されているコードになりますが、static allocateの部分は、計算サイズと並列化する分割のやり方を事前に指定しコンパイルする使い方となります。

まずは、これまでインストールしていなくビルドに必要なパッケージを入れます。（compute-1のみで大丈夫です）
姫野ベンチがlzh形式で圧縮されているのでその展開用に使うツールと、ビルドに使うコマンドです。

root@compute-1:~# apt install lhasa -y
root@compute-1:~# apt install make -y

ではまずは姫野ベンチをダウンロードし展開しましょう。zipを展開したらlzhが出てくるので注意です。（アップ時のミスかな？？）展開先はNFSストレージ上です。

mpiuser@compute-1:~$ wget http://accc.riken.jp/wp-content/uploads/2015/07/cc_himenobmtxp_mpi.zip
mpiuser@compute-1:~$ unzip cc_himenobmtxp_mpi.zip
mpiuser@compute-1:~$ lha xw=/nfs/himeno cc_himenobmtxp_mpi.lzh
/nfs/himeno/himenoBMTxps.c	- Melted   :  oo
/nfs/himeno/Makefile.sample	- Melted   :  o
/nfs/himeno/param.h	- Melted   :  o
/nfs/himeno/paramset.sh	- Melted   :  o
mpiuser@compute-1:~$ cd /nfs/himeno/
mpiuser@compute-1:/nfs/himeno$ ls -lh
total 28K
-rw------- 1 mpiuser mpiuser  13K Sep 29  2005 himenoBMTxps.c
-rw------- 1 mpiuser mpiuser  251 Feb 21  2002 Makefile.sample
-rw------- 1 mpiuser mpiuser  202 Feb 21  2002 param.h
-rw------- 1 mpiuser mpiuser 2.1K Feb 21  2002 paramset.sh

まずはMakefileですが、今回はサンプルのものをそのまま使えます。

mpiuser@compute-1:/nfs/himeno$ cp Makefile.sample Makefile

次にmakeする前にパラメータを設定します。設定ツールとしてparamset.shが用意されています。
使い方は、最初に計算サイズ（XS, S, M, L, XL, XXL）を指定し、3次元分の分割数を指定します。
ここでは、計算サイズをM、2並列で分割する例です。パラメータ設定後にmakeします。

mpiuser@compute-1:/nfs/himeno$ chmod +x ./paramset.sh
mpiuser@compute-1:/nfs/himeno$ ./paramset.sh M 1 1 2
mpiuser@compute-1:/nfs/himeno$ make

実行はこれまでどおりですが、プロセス数（-np）はパラメータ設定と一致させる必要が有ります。
今回（./paramset.sh M 1 1 2）だと1*1*2=2で-np 2となります。
例えば-np4 の場合は、1*2*2=4になります。（./paramset.sh M 1 2 2）

mpiuser@compute-1:/nfs/himeno$ mpirun -np 2 --hostfile ~/my_hosts /nfs/himeno/bmt 
Sequential version array size
 mimax = 129 mjmax = 129 mkmax = 257
Parallel version array size
 mimax = 129 mjmax = 129 mkmax = 131
imax = 128 jmax = 128 kmax =129
I-decomp = 1 J-decomp = 1 K-decomp =2
 Start rehearsal measurement process.
 Measure the performance in 3 times.

 MFLOPS: 5504.816162 time(s): 0.074719 1.667103e-03

 Now, start the actual measurement process.
 The loop will be excuted in 2409 times
 This will take about one minute.
 Wait for a while

cpu : 600.021093 sec.
Loop executed for 2409 times
Gosa : 4.109818e-04 
MFLOPS measured : 550.457770
Score based on Pentium III 600MHz : 6.644831

最終的な結果は、MFLOPS measuredの値になりますので、今回は 550.457770 MFLOPS ということになります。

もし、姫野ベンチを別の環境で動かしたことがある人なら、やけに実行完了まで時間がかかったと感じると思います。
これは姫野ベンチの実装と、IDCFクラウドのLight.S1の動作に関係しています。
姫野ベンチでは、最初に3回計算し、その結果をもとに十分な計算回数かつ適度な処理時間（1分×プロセス数）になるよう回数が決められ最終的な計測が行われます。

一方で、IDCFクラウドのLight.S1はスペック上は800MHzとなっていますが、これはクロックキャッピングによるもので、ずっと負荷がかかると800MHzまでしか出ないですが、瞬間的であれば800MHzよりバーストして物理CPUスペック分性能が出ます。

そのため、物理CPU分の性能で計算回数を決めてしまい、実際の計算がはじまると本来の800MHzで計算するので何倍も余計に時間がかかってしまい、ノード数が増えるとその分さらに時間がかかってしまいます。

上記の2プロセスの場合、10分ほどかかりました（本来なら2分くらいになる）。また、4プロセスにすると100分以上かかりました。
しかも以下のとおり結果は、2プロセス（2ノード）より4プロセス（4ノード）の方が遅くなっています。

mpiuser@compute-1:/nfs/himeno$ mpirun -np 4 --hostfile ~/my_hosts /nfs/himeno/bmt 
Sequential version array size
 mimax = 129 mjmax = 129 mkmax = 257
Parallel version array size
 mimax = 129 mjmax = 67 mkmax = 131
imax = 128 jmax = 65 kmax =129
I-decomp = 1 J-decomp = 2 K-decomp =2
 Start rehearsal measurement process.
 Measure the performance in 3 times.

 MFLOPS: 11850.404321 time(s): 0.034709 1.702009e-03

 Now, start the actual measurement process.
 The loop will be excuted in 5185 times
 This will take about one minute.
 Wait for a while

cpu : 6080.535240 sec.
Loop executed for 5185 times
Gosa : 1.423682e-04 
MFLOPS measured : 116.912427
Score based on Pentium III 600MHz : 1.411304

次回は、このあたり詳しくみていきましょう。計測時間がながくなることが問題なのか、そもそもクロックキャッピングが悪さするのか。

fujish.hateblo.jp