MPI集體通訊（一）

集體通訊（collective communications) 能夠在MPI指定的通訊域中所有的進程之間傳遞數據。MPI提供以下幾種集體通訊例子：
1）同步所有進程
2）全局通訊函數：包括Broadcase,Gather,Scatter,以及全局的規約操作（含求和，最大值，最小值等等），全局通訊函數基本來說有三個通訊模式：
a) 根進程給所有進程（包括自己）發送數據，如broadcase,scatter
b) 根進程從所有進程（包括自己)接收數據，如gather
c) 每個進程和每個進程通訊（包括自己），如allgather和alltoall.
MPI集體通訊的語法設計和點對點通信基本一致。然而，爲了使用方便，MPI集體通訊函數比點對點通訊有更多的限制。如，和點對點通訊不同，集體通訊中數據發送的數量必須和接收函數指定的數量精確匹配。另外一個簡化是集體通訊函數採用阻塞通訊模式。集體通訊函數不需要用tag，因此，在通訊域內通訊時，集體通訊函數的調用是根據執行的順序而嚴格匹配的。
最後的簡化是集體通訊函數僅有一種模式，這個模式可以類比爲點對點通訊中的標準模式。更具體的說，一個集體通訊函數能夠在它完成參與整個通訊域之後立刻返回。通常，完成意味着函數調用者能夠修改通訊緩衝器的內容，並不意味着其他進程也完成了相應操作，甚至開始了相應操作。
1. 阻塞同步函數MPI_BARRIER
MPI_BARRIER能夠阻塞調用函數，直到所有進程都調用了此函數。簡單來說，任何進程只能夠在其他所有進程都調用了此函數之後才能夠返回。
2. 廣播函數MPI_BCAST
MPI_BCAST能夠將根進程的消息廣播給其他所有進程。此函數返回的時候，意味着根進程緩存中的內容已經拷貝給了所有進程。導出的數據類型也可以在此函數中使用。例子：

MPI_Bcast(
void *buffer //starting address of buffer
int count //number of entries in buffer
MPI_Datatype datatype //data type of buffer
int root //rank of broadcast root
MPI_Comm comm) //group communicator

MPI_Comm comm;
int array[100];
int root = 0;
......
MPI_Bcast(array,100,MPI_INT,root,comm);

For example, say 0 is the root process. To process 0, MPI_Bcast functions as a send with buffer being the sending buffer. For everybody else, MPI_Bcast functions as a receive with buffer being the receiving buffer. So, the code to use MPI_Bcast would be something like this:
.
.
.
if (process 0)
{
some_value=something;
}

    /* at this point, only process 0 has some_value */

    MPI_Bcast(&some_value,...0,...)

    /* at this point everybody has some_value */
    .
    .
    .

3.收集函數Gather
每個進程（包括根進程）發送它發送緩存中的內容到根進程。根進程收到並且根據進程序號存儲接收的內容。所以其結果好像是每一個進程都執行了一次MPI_Send(sendbuf,sendcount,sendtype,root,tag,comm) 並且根進程執行了n次MPI_Recv(recvbuff+i*recvcount*extent(recvtype),recvcount,recvtype,i,tag,comm).
通常，導出數據類型也能夠在Gather函數中

MPI_Gather(
void *sendbuf //starting address of send buffer
int sendcount //number of elements in send buffer
MPI_Datatype sendtype //data type of send buffer elements
void *recvbuf //starting address of receive buffer
int *recvcounts //number of elements for any single receive
MPI_Datatype recvtype //data type of recv buffer elements
int root //rank of receiving process
MPI_Comm comm) //group communicator

See the blow image for the detailed information:

Gather 100 ints from every process in group to root:
MPI_Comm comm;
int gsize,sendarray[100];
int root, * rbuf;
.....
MPI_Comm_size(comm,&gsize);
rbuf = (int*)malloc(gsize*100*sizeof(int));
MPI_Gather(sendarry,100,MPI_INT,rbuf,100,MPI_INT,root,comm);

或者，可以只在根進程分配接收緩存：

MPI_Comm comm;
int gsize,sendarray[100];
int root, * rbuf，myrank,root;
.....
MPI_Comm_rank(comm,myrank);
if(myrank == root){
MPI_Comm_size(comm,&gsize);
rbuf = (int*)malloc(gsize*100*sizeof(int));
}
MPI_Gather(sendarry,100,MPI_INT,rbuf,100,MPI_INT,root,comm);

也可以用導出的數據類型實現

MPI_Comm comm;
int gsize,sendarray[100];
int root,*rbuf;
MPI_Datatype rtype;
..
MPI_comm_size(comm,&gsize);
MPI_Type_contiguous(100,MPI_INT,&rtype);
MPI_Type_commit（&rtype);
rbuf=(int*)malloc(gsize*100*sizeof(int));
MPI_Gather(sendarray,100,MPI_INT,rbuf,1,rtype,root,comm);

4) MPI_Gatherv
MPI_Gatherv擴展了MPI_Gather函數的功能。利用MPI_Gatherv，每個進程可以接收不同長度的數據，因爲此時recvvounts在擴展後的函數中是一個數組，而不是一個數。此函數也能夠通過一個新的參數來指定數據存放在根目錄的位置。
“v”可以理解爲variable,或varying之意，因爲此類函數允許改變消息長度以及消息在存儲器中的位置。

MPI_Gatherv(
const void * sbuf  //starting address of send buffer
int scout              //number of elements in send buffer
MPI_Datatype stype                  //data type of send buffer elements
void * rbuf           //starting address of receive buffer
const int rcount[], //array containing number of elements to be received from each process,
const int disp[] //array specifying the displacement relative to rbuf at which to place the incoming data                                      from corresponding process,
MPI_Datatype rtype //data type of receive buffer
int root // rank of receiving process
MPI_Comm comm //group communicator

Note: rbuf, rcounts, displs, rtype are significant for the root process only.

Example: 每個處理器發送100個整數給根進程。根進程接收並根據指定的間隔放置（stride)。假設stride>=100.

MPI_Comm comm;
int gsize,sendarray[100];
int root,*rbuf,stride;
int *displs,i,*rcounts;
.........
MPI_Comm_size(comm,&gsize);
rbuf = (int*)malloc(gsize*stride*sizeof(int));
displs = (int*)malloc(gsize*sizeof(int));
rcounts = (int*)malloc(gsize*size(int));
for(i=0;i<gsize;++i){
displs[i] = i*stride; //Note this is how to specify the displacement,stride is actually equal to the number of elements plus the separation of data chunks.
rcounts[i] = 100;
}

MPI_Gatherv(sendarray,100,MPI_IN,rbuf,rcouns,displs,MPI_INT,root,comm);

Note that the program is erroneous if stride ranging between -100 to 100

5 MPI_Scatter

int MPI_Scatter(
const void* sbuf, //the address of send buffer
int scount, //the number of elements to be sent in each process
MPI_Datatype stype //the data type of send buffer elements
void* rbuf //the address of receive buffer
int rcount //the number of elements in the receive buffer
MPI_Datatype rtype //datatype of the receive buffer elements
int root //rank of process sending process
MPI_Comm comm //name of communicator
)

note: sbuf, scount, stype are significant for the root process only.
MPI_Scatter is the inverse operation to MPI_Gather.Scatter 函數是gather 函數的逆操作。其執行結果好像是根進程執行了n次發送操作，而每個進程執行了一個MPI_Recv()操作.
或者可以這樣來理解MPI_Scatter.根進程用MPI_Send發送了一個消息。這個消息被複製成n個相同的拷貝，然後ith個拷貝被髮送給ith個進程，每個進程接收自己的消息。

Example, 分散100個整數到每個進程

MPI_Comm comm;
int gsize,*sendbuf;
int root,rbuf[100];
...
MPI_Comm_size(comm,&gsize);
sendbuf = (int*)malloc(gsize*100*sizeof(int));//計算髮送buffer的大小
MPI_Scatter(sendbuf,100,MPI_INT,rbuf,100,MPI_INT,root,comm);

6 MPI_Scatterv

int MPI_Scatterv(
const void* sbuf, //address of send buffer
const int scounts[], //integer array specifying the number of elements to send to each process,
const int displs[], //array specifying the displacement relative to sbuf from which to take the data going out to the corresponding process,(指定了相對於send buffer偏移的一個數組。數據將叢send buffer取出併發送給相應的進程。
MPI_Datatype stype, //data type of send buffer element
void* rbuf, //address of receive buffer
int rcount,//number of elements in receive buffer
MPI_Datatype rtype, //data type of receive buffer elements
int root, //rank of sending process
MPI_Comm comm //group communicator
)

MPI_Scatterv是MPI_Gatherv的逆操作。它擴展了MPI_Scatter操作，以便能夠發送不同長度的數據給不同的進程。通過提供新的參數displs,函數能夠指定在根進程上提取數據的位置。

假設現在要從根進程各分散100個整數給其它進程，但這100個整數在根進程中的間隔是stride(stride>100).所以要用MPI_Scatterv來實現：

MPI_Comm comm;
int gsize,*sendbuf;
int root,rbuf[100],i,*displs,*scounts;
.....
MPI_Comm_size(comm,&gsize);
sendbuf = (int*)malloc(gsize*stride*sizeof(int));
....
displs = (int*)malloc(gsize*sizeof(int));
scounts = (int*)malloc(gsize*sizeof(int));
for(i=0;i<gsize;++i){
displs[i] = i*stride;
scounts[i] = 100;
}
MPI_Scatterv(sendbuf,scounts,displs,MPI_INT,rbuf,100,MPI_INT,root,comm);

7 MPI_ALLgather

int MPI_Allgather(
const void* sbuf, //starting address of send buffer
int scount,//number of element to send to each process
MPI_Datatype stype,//datatype of send buffer elements
void* rbuf,//address of receive buffer
int rcount, //number of elements to receive from process
MPI_Datatype rtype,//datatype of receive buffer elements
MPI_Comm comm)//group communicator

MPI_ALLgather其實和MPI_Gather很類似，其區別是在MPI_ALLgather中，所有的進程都收到結果，而不像是MPI_Gather中只有根進程收到結果。

8 MPI_ALLgatherv

MPI_ALLgatherv(
const void *sbuf //starting address of send buffer
int scounts //number of element to send to each process
MPI_Datatype stype //data type of send buffer elements;
void * rbuf //starting address of receive buffer
const int rcount[] // number of elements receive from processes
const int displs[] //displacement of data relative to rbuf
MPI_Datatype rtype //data type of receive buffer elements
MPI_Comm comm) //group communicator

同理，MPI_ALLgatherv和MPI_ALLgather類似，不同之處在於這裏所有進程都收到結果。

9 MPI_Alltoall

MPI_Alltoall(
void * sbuf //starting address of send buffer
int scounts[] //number of elements sent to each process
MPI_Datatype stype //data type of send buffer elements
void * rbuf //address of receive buffer
int rcount //number of elements received from any process
MPI_Datatype rtype //data type of receive buffer elements
MPI_Comm comm) //group communicator

MPI_Alltoall 是MPI_ALLgather的擴展。每個進程發送不同的數據給每個接收者。從第i個進程發送的第j個數據塊被第j個進程接收並防在緩存總第i個塊。

10 MPI_Alltoallv

MPI_Alltoallv(
void * sbuf //starting address of send buffer
int scounts[] //number of elements sent to each process
int *sdispls //the displacements of data relative to sbuf
MPI_Datatype stype //data type of send buffer elements
void * rbuf //address of receive buffer
int rcount //number of elements received from any process
int *rdispls //are the displacements of data relative to rbuf
MPI_Datatype rtype //data type of receive buffer elements
MPI_Comm comm) //group communicator

和MPI_Alltoall比較，MPI_Alltoallv添加了兩個數組用於指定發送數據的位置和接收數據的存放位置。進程i的第j個數據塊被進程j接收存放在接收緩存的第i個塊。這些數據塊不需要有同樣的尺寸。

規約函數

在MPI中，有兩種涉及到全局計算的函數：reduce和scan. reduce 輸出對一個分佈數據序列進行某種操作後的全部結果（full results).scan輸出“incremental results”. MPI提供了四個全局計算函數。
1）MPI_Reduce

MPI_Reduce(
void *sendbuf //address of send buffer
void *recvbuf //address of receive buffer
int count //number of elements in send buffer
MPI_Datatype datatype //data type of elements of send buffer
MPI_Op op //reduce operation
int root //rank of root process
MPI_Comm comm) //group communicator

MPI_Reduce 聯合通訊域內每個進程輸入緩存中的數據，進行op操作，然後在root進程的輸出緩存中返回一個結果。輸入buffer用三個參數來定義：sendbuf,count,and datatype，分別指定位置，數目，以及數據類型。輸出buffer也有類似的三個參數：recvbuf,count,and datatype. 因此，輸出輸出buffer擁有同樣數目和同樣類型的數據。 count,op,and root在所有進程上都應該相等。因此所有進程上的輸出輸出buffer都有同樣長度，同樣數據類型。
關於op,MPI提供了一些預定義的操作，但用戶也可以根據需要自己定義op,自己定義的op可以在函數中重載。
List of Predefined reduce operations
1) MPI_MAX //return the maximum element
2) MPI_MIN //return the minimum element
3) MPI_SUM //sum the elements
4) MPI_PROD //multiple all elements
5) MPI_LAND //perform a logical and across all elements
6) MPI_BAND //perform a bitwise and across all elements
7) MPI_LOR /perform a logical or across all elements
8) MPI_BOR //perform a bitwise or across all elements
9) MPI_LXOR //logical xor
10) MPI_BXOR //bit wise xor
11) MPI_MAXLOC //return the maximum value of the rank of process that owns it
12) MPI_MINLOC //return the minimum value of the rank of process that owns it

The following picture shows the results using operation MPI_SUM:

MPI_Reduce 的另外一個相似版本是MPI_ALLreduce. 這個函數的功能可以從下面這張圖看出：

MPI_ALLreduce 相當於先做一個MPI_Reduce，然後跟隨一個MPI_Bcast.

MPI_Reduce 第三個版本是MPI_Reduce_scatter.

MPI_Reduce_scatter(
void * sendbuf //starting address of send buffer
void * recvbuf //starting address of receive buffer
int *recvcounts //number of elements to be scattered back
MPI_Datatype datatype //data type of elements of input buffer
MPI_Op op //operation
MPI_Comm comm） //group communicator

MPI_Reduce-scatter在功能上和下列操作等價：首先一個MPI_Reduce操作，然後是一個MPI_Scatterv操作。

2） MPI_Scan

MPI_Scan(
void *sendbuf //starting address of send buffer
void *recvbuf //starting address of receive buffer
int count // number of elements in input buffer
MPI_Datatype datatype //data type of elements of input buffer
MPI_Op op //operation
MPI_Comm comm //comunicator

假設scan中operation 就是MPI_SUM。”In computer science, the prefix sum, scan, or cumulative sum of a sequence of numbers x0, x1, x2, … is a second sequence of numbers y0, y1, y2, …, the sums of prefixes (running totals) of the input sequence:

y0 = x0
y1 = x0 + x1
y2 = x0 + x1+ x2”

因此這裏的scan其實就是累積求和的意思。
下面圖片很好的說明了MPI_Scan的功能：

自定義operation函數：

MPI_Op myOp;

MPI_Op_create( MPI_User_function *function, int commute, MPI_Op *myOp );
function    //user defined function (function)
commute //true if commutative; false otherwise.
myOp    //operation (handle)

MPI_Op_create creates a user-defined combination function handle.

This new hande, op variable, that is created can now be used with a call to MPI_Reduce and MPI_Scan.

if commute = true, then the operation should be both commutative and associative. if commute=false, then the order of operations is fixed and is defined to be in ascending,process rank order, beginning with process zero. The order of evaluation can be changed, taking advantage of the associativity of the operation.

Associative law:
(x ∗ y) ∗ z = x ∗ (y ∗ z) for all x, y, z

The term “commutative” is used in several related senses.[1][2]
A binary operation * on a set S is called commutative if:
x * y = y * x for all x,y in S
An operation that does not satisfy the above property is called non-commutative.
One says that x commutes with y under * if:
x * y = y * x

function is the user defined function, which must have the following four arguments:invec,inoutvec,len,and datatype.

例子剖析
1） switch(rank){
case 0:
MPI_Bcast(buff1,count,type,0 comm);
MPI_Send(buf2,count,type,1,tag,comm);
break;
case 1:
MPI_Recv(buf2,count,type,0,tag,comm);
MPI_Bcast(buf1,count,type,0,comm);
break;
}
進程0首先執行廣播操作，然後一個阻塞發送操作。進程1首先是一個阻塞的接受操作來匹配進程0的發送操作。然後是一個廣播操作來匹配進程0的廣播操作。此程序可能死鎖。進程0的廣播操作可能鎖住，直到進程1執行了匹配的廣播調用，結果進程0的發送不能夠被執行。進程1將確定的鎖在接收操作，永遠不會執行廣播操作。
因此，在混合使用點對點及集體通訊時，應該小心。他們的相對順序應該排列正確，以便即使兩種操作是同步的，也不會發生死鎖。