Sending and Receiving with Halo Exchanges

From our examination of the illustrations for column-major and row-major languages, we conclude that we should split our two-dimensional grid by columns for Fortran and similar languages, and by rows for C++, Python, and others with that array orientation.

Let us assume for both cases that we have nr rows and nc columns in our grid. In a one-dimensional decomposition, we will split only one of those values among the processes. Call the local number of rows and columns nrl and ncl. When boundary (“ghost” or real) are added, each subdomain will contain nrl+2 and ncl+2 rows and columns. Our computed domains will extend from $1$ to nrl and 1 to ncl, with boundaries at $0$ and at $nrl+1$ and $ncl+1$. Two exchanges will require two Sendrecv invocations.

C++ and Python

For these languages $nrl = nr \div nprocs$ and $ncl=nc$. We will send the first computed row up to row $nrl+1$ of the process at $rank-1$ and the last computed row down to row $0$ of the process at $rank+1$.

C++ syntax:

MPI_Sendrecv(&w[1][1],nc, MPI_DOUBLE,up,tag,&w[nrl+1][1],
                      nc, MPI_DOUBLE,down,tag,MPI_COMM_WORLD,&status);

MPI_Sendrecv(&w[nrl][1],nc,MPI_DOUBLE,down,tag,&w[0][1],
                        nc,MPI_DOUBLE,up,tag,MPI_COMM_WORLD,&status);

Note that although the variable w alone would be a pointer, a specific element is not (it dereferences the memory location) and so must be passed to MPI by reference in C++.

Python syntax:

comm.Sendrecv([w[1,1:nc+1],MPI.DOUBLE], up, tag, [w[nr+1,1:nc+1],MPI.DOUBLE], down, tag )
comm.Sendrecv([w[nr,1:nc+1],MPI.DOUBLE], down, tag, [w[0,1:nc+1],MPI.DOUBLE], up, tag )

Fortran

For Fortran $nrl=nr$ and $ncl= nc \div nprocs$. We send the first computed column left to column $ncl+1$ of $rank+1$ and the last computed column right to column $0$ of $rank-1$.

call MPI_SENDRECV(w(1:nr,1),nr,MPI_DOUBLE_PRECISION,left,tag, w(1:nr,ncl+1),nr,&
                               MPI_DOUBLE_PRECISION,right,tag,                 &
                                                  MPI_COMM_WORLD,mpi_stat,ierr)

call MPI_SENDRECV(w(1:nr,ncl),nr,MPI_DOUBLE_PRECISION,right,tag, w(1:nr,0),nr, &
                                 MPI_DOUBLE_PRECISION,left,tag,                &
                                                  MPI_COMM_WORLD,mpi_stat,ierr)

MPI_PROC_NULL

Rank $0$ has no neighbor to the top or left, and rank $nprocs$ has no neighbor to the bottom or right. We might think that we would have to write a complicated set of conditionals to handle these situations, but this is a common scenario and MPI provides a built-in solution: MPI_PROC_NULL.

The special value MPI_PROC_NULL can be used in place of a source or destination and results in a “no op,” i.e. the send or receive is not attempted and the function returns immediately.

Examples

These example codes set up and execute one halo exchange. First we find the neighbors for each rank (“up” and “down” or “left” and “right”) before doing any transfers. Boundary conditions are set, though in these examples they do not represent anything physical.

In order to verify our results, we print the beginning values of the local arrays for each rank. In order to keep the values organized, all non-root processes will send values to the root process, which will then print them. This is repeated after the exchange, so we can check that the exchanges are correct. This is not efficient and is not normally done in a production code, but it allows us to understand what happens. Remember that “real” boundary values are sent into halos (also called ghost zones).

Fortran and NumPy arrays are contiguous in memory, but C/C++ currently lacks true multidimensional, dynamically-sized arrays. Our example C++ code sets up the arrays as an array of pointers, each pointing to a one-dimensional array. These are not generally consecutive in memory. For MPI the “buffere is passed as the pointer to the first locations. If the array is not contiguous, as is usually the case, sending ncount*size_of_type will not pull in all the values. Therefore we pack the two-dimensional array into a one-dimensional array for printing.

In order to receive the buffer from each of the other processes for printing,root will loop through all other ranks, receiving, while the send involves a test to sthat each rank sends once to the root. We use MPI_ANY_SOURCE and MPI_ANY_TAG (MPI.ANY_SOURCE and MPI.ANY_TAG for mpi4py) to match whatever message srrives, since we cannot guarantee ordering. We can then use the MPI _Status variable to extract information about the sender.

C++

#include <iostream>
#include <cstring>
#include <mpi.h>

using namespace std;

int main (int argc, char *argv[]) {

    int i, j;
    int N;

    // Added for MPI
    int nr, nrl, nc, ncl;
    int rank, nprocs;
    MPI_Status status;
    int root=0, tag=0;
    int up, down;

     //Initialize MPI
    MPI_Init(&argc, &argv);
    MPI_Comm_size(MPI_COMM_WORLD,&nprocs);
    MPI_Comm_rank(MPI_COMM_WORLD,&rank);

   //Usually we would read in nr and nc; they represent the global grid RxC;
   //For this case they are both N.
    N=12;
    nr=N;
    nc=N;

    //Use ncl for clarity
    ncl=nc;

    if (nr%nprocs!=0) {
        MPI_Finalize();
        cout<<"Not an even division of the arrays.\n";
        return 0;
    } 
    else {
    nrl=nr/nprocs;
    }

    //Find my neighbors
    if (rank==0) {
        up=MPI_PROC_NULL;
    }
    else {
        up=rank-1;
    }

    if (rank==nprocs-1) {
        down=MPI_PROC_NULL;
    }
    else {
        down=rank+1;
    }

    cout<<"Topology "<<rank<<" "<<up<<" "<<down<<endl;

    int nrows=nrl+2; int ncols=ncl+2; 
    double **w=new double*[nrows];
    double *wptr=new double[(nrows)*(ncols)];

    for (int i=0;i<nrl+2;++i,wptr+=ncl+2) {
       w[i] = wptr;
    }

    //Initialize 

    for (int i=0; i<=nrl+1; ++i) {
        for (int j=0; j<=ncl+1; ++j) {
            w[i][j]=50.;
        }
    }

    for ( int j=0; j<=ncl+1; ++j ) {
        w[1][j]=(double)(rank+1)*2.;
        w[nrl][j]=(double)(rank+1)*2.5;
    }

    double topBC=0.;
    double bottomBC=200.;
    double leftBC=100.;
    double rightBC=100.;

    for (int i=0;i<=nrl+1;++i){
        w[i][0]=leftBC;
        w[i][ncl+1]=rightBC;
    }

    if (rank==0) {
       for (int i=0;i<=ncl+1;++i){
          w[0][i]=topBC;
       }
    }
    if (rank==nprocs-1) {
        for (int i=0;i<+ncl+1;++i){
            w[nrl+1][i]=bottomBC;
       }
    }

    //starting
    //This forces the output to show one rank at a time. It's not efficient.
    int uwsize=(nrl+2)*(ncl+2);
    double *u =  new double[uwsize];
    memset(u,0.,uwsize);
    int position;

    MPI_Barrier(MPI_COMM_WORLD);

    if ( rank == 0 ) {
        cout<<"---W Before for rank 0---"<<endl;
        for (int i=0;i<nrl+2;i++) {
            for (int j=0;j<ncl+2;j++) {
                cout<<w[i][j]<<" ";
            }
            cout<<endl;
        }
        for (int n=1;n<nprocs;++n) {

            memset(u,0.,uwsize);
            MPI_Recv(u,uwsize,MPI_DOUBLE,MPI_ANY_SOURCE,MPI_ANY_TAG,MPI_COMM_WORLD,&status);
            cout<<"--Before for rank "<<status.MPI_SOURCE<<endl;
            position=0;
            for (int i=0;i<=nrl+1;i++) {
                for (int j=0;j<=ncl+1;j++) {
                    cout<<u[position++]<<" ";
                }
                cout<<endl;
            }
        }
    }
    else {

        // Pack the 2D array into the buffer
        cout<<"Size "<<uwsize<<" "<<nrl<<" "<<ncl<<" "<<nrl+2<<" "<<ncl+2<<endl;
        position=0;
        for (int i=0; i<=nrl+1; i++) {
            for (int j=0;j<=ncl+1; j++) {
                u[position++]=w[i][j];
            }
        }

        MPI_Send(u,uwsize, MPI_DOUBLE, 0, rank, MPI_COMM_WORLD);

    }

    MPI_Barrier(MPI_COMM_WORLD);
    //end of test section


    MPI_Sendrecv(&w[1][0],ncl+2, MPI_DOUBLE,up,tag,&w[nrl+1][0],
                          ncl+2, MPI_DOUBLE,down,tag,MPI_COMM_WORLD,&status);

    MPI_Sendrecv(&w[nrl][0],ncl+2,MPI_DOUBLE,down,tag,&w[0][0],
                             ncl+2,MPI_DOUBLE,up,tag,MPI_COMM_WORLD,&status);

       //Print results
    MPI_Barrier(MPI_COMM_WORLD);

    if ( rank == 3 ) {
        cout<<"---After for rank --"<<endl;
        for (int i=0;i<nrl+2;i++) {
            for (int j=0;j<ncl+2;j++) {
                cout<<w[i][j]<<" ";
            }
            cout<<endl;
        }

    }
    MPI_Barrier(MPI_COMM_WORLD);

    if ( rank == 0 ) {
        cout<<"---After for rank 0---"<<endl;
        for (int i=0;i<nrl+2;i++) {
            for (int j=0;j<ncl+2;j++) {
                cout<<w[i][j]<<" ";
            }
            cout<<endl;
        }
        for (int n=1;n<nprocs;++n) {

            memset(u,0.,uwsize);
            MPI_Recv(u,uwsize,MPI_DOUBLE,MPI_ANY_SOURCE,MPI_ANY_TAG,MPI_COMM_WORLD,&status);
            cout<<"--After for rank "<<status.MPI_SOURCE<<endl;
            position=0;
            for (int i=0;i<=nrl+1;i++) {
                for (int j=0;j<=ncl+1;j++) {
                    cout<<u[position++]<<" ";
                }
                cout<<endl;
            }
        }
    }
    else {

        // Pack the 2D array into the buffer
        position=0;
        for (int i=0; i<=nrl+1; i++)
            for (int j=0;j<=ncl+1; j++)
                u[position++]=w[i][j];

        MPI_Send(u,uwsize, MPI_DOUBLE, 0, rank, MPI_COMM_WORLD);

    }

    MPI_Finalize();

}


Fortran

program halo
   use mpi
   implicit none

   double precision, allocatable, dimension(:,:)  :: u,w
   integer            :: N=12
   double precision   :: topBC, bottomBC, leftBC, rightBC
   integer            :: i,j

   integer            :: nr, nc, nrl, ncl
   integer            :: rank, nprocs, tag=0
   integer            :: ierr
   integer, dimension(MPI_STATUS_SIZE) :: mpi_stat
   integer, parameter :: root=0
   integer            :: left, right

   !Initialize MPI, get the local number of columns
   call MPI_INIT(ierr)
   call MPI_COMM_SIZE(MPI_COMM_WORLD,nprocs,ierr)
   call MPI_COMM_RANK(MPI_COMM_WORLD,rank,ierr)

   !For this case
   !Usually we'd read in nr and nc; they represent the global grid size RxC.
   nr=N
   nc=N

   !Generally we'd just use nr and not duplicate; this is for clarity later
   nrl=nr

   ! weak scaling
   !ncl = nc
   ! strong scaling
   if (mod(nc,nprocs).ne.0) then
       call MPI_FINALIZE(ierr)
       stop 'Not an even division of the array'
   else
       ncl = nc/nprocs
   endif

   ! Find my neighbors
   if ( rank .eq. root ) then
      left = MPI_PROC_NULL
   else
      left = rank - 1
   endif

   if ( rank .eq. nprocs-1 ) then
      right = MPI_PROC_NULL
   else
      right = rank + 1
   endif

   !In this code, u is only used for verification of results
   allocate(w(0:nrl+1,0:ncl+1),u(0:nrl+1,0:ncl+1))

   ! Initialize the array
   w(:,:)=50.
   !Set up rows to exchange
   w(:,1)=dble(rank+1)*2
   w(:,ncl)=dble(rank+1)*2.5

   ! Boundary conditions
   topBC=0.d0
   bottomBC=200.d0
   leftBC=100.d0
   rightBC=100.d0

   !Fortran stitches by column
   w(0,:)=topBC
   w(nrl+1,:)=bottomBC
   if (rank==0) then
      w(1:nrl,0)=leftBC
   else if (rank==nprocs-1) then
      w(1:nrl,ncl+1)=rightBC
   endif

   write(*,*) "Topology ",rank, left, right

   !starting
   !This forces the output to show one rank at a time. It's not efficient.
   if (rank==0) then
       write(*,*) 'Initial for rank 0'
          do j=0,nrl+1
             write(*,'(*(f10.2))') w(j,:)
          enddo
       do i=1,nprocs-1
          u=0.d0
          call MPI_Recv(u,size(u),MPI_DOUBLE_PRECISION,MPI_ANY_SOURCE,         &
                        MPI_ANY_TAG, MPI_COMM_WORLD,mpi_stat,ierr)
          write(*,*) 'Initial for rank',mpi_stat(MPI_SOURCE)
          do j=0,nrl+1
             write(*,'(*(f10.2))') u(j,:)
          enddo
       enddo
   else
       call MPI_Send(w,size(w),MPI_DOUBLE_PRECISION,0,rank,MPI_COMM_WORLD,ierr)
   endif

   call MPI_Barrier(MPI_COMM_WORLD,ierr)

   ! Exchange halo values
   call MPI_SENDRECV(w(1:nrl,1)    ,nrl,MPI_DOUBLE_PRECISION,left,tag,         &
                     w(1:nrl,ncl+1),nrl,MPI_DOUBLE_PRECISION,right,tag,        &
                                        MPI_COMM_WORLD,mpi_stat,ierr)


   call MPI_SENDRECV(w(1:nrl,ncl)  ,nrl,MPI_DOUBLE_PRECISION,right,tag,        &
                     w(1:nrl,0)    ,nrl,MPI_DOUBLE_PRECISION,left,tag,         &
                                        MPI_COMM_WORLD,mpi_stat,ierr)

   !result
   if (rank==0) then
       write(*,*) 'Result for rank 0'
          do j=0,nrl+1
             write(*,'(*(f10.2))') w(j,:)
          enddo
       do i=1,nprocs-1
          u=0.d0
          call MPI_Recv(u,size(u),MPI_DOUBLE_PRECISION,MPI_ANY_SOURCE,         &
                        MPI_ANY_TAG, MPI_COMM_WORLD,mpi_stat,ierr)
          write(*,*) 'Result for rank',mpi_stat(MPI_SOURCE)
          do j=0,nrl+1
             write(*,'(*(f10.2))') u(j,:)
          enddo
       enddo
   else
       call MPI_Send(w,size(w),MPI_DOUBLE_PRECISION,0,rank,MPI_COMM_WORLD,ierr)
   endif

   call MPI_FINALIZE(ierr)

end program 


Python

import sys
import numpy as np
from mpi4py import MPI

comm = MPI.COMM_WORLD
rank = comm.Get_rank()
nprocs = comm.Get_size()

N = 12

#Strong scaling
if N%nprocs==0:
    nrl = N//nprocs
else:
    print("Number of ranks should divide the number of rows evenly.")
    sys.exit()

#Weak scaling
#nrl = N

#Both
ncl = N

w = np.zeros((nrl+2, ncl+2), dtype=np.double)

#Arbitarary value for interior
#Be sure not to overwrite boundaries.
w[1:nrl+1,1:ncl+1] = 50.

#Rank-specific values for some rows
w[1,:]=(rank+1)*2.
w[nrl,:]=(rank+1)*2.5

#Set up boundaries
topBC=0.
bottomBC=200.
leftBC=100.
rightBC=100.

if rank == 0 :
    w[0,:] = topBC     # up

if rank == nprocs-1 :
    w[nrl+1,:] = bottomBC  # bottom

w[:,0] = leftBC      # left
w[:,ncl+1] = rightBC   # right

#This forces the output to show one one rank at a time. It is not efficient.
status=MPI.Status()

if rank==0:
    u=np.zeros_like(w)
    print("Initial for rank 0")
    for i in range(nrl+2):
        print(w[i,:])
    print()

    for i in range(1,nprocs):
        u[:,:]=0.0
        comm.Recv([u,MPI.DOUBLE],source=MPI.ANY_SOURCE,tag=MPI.ANY_TAG,status=status)
        print("Initial for rank",status.Get_source())
        for i in range(nrl+2):
            print(u[i,:])
    print()
else:
    comm.Send([w,MPI.DOUBLE],dest=0,tag=rank)

comm.Barrier()
##End debugging section

# setting up the up and down rank for each process
if rank == 0 :
    up = MPI.PROC_NULL
else :
    up = rank - 1

if rank == nprocs - 1 :
    down = MPI.PROC_NULL
else :
    down = rank + 1

print("Topology", rank, up, down)

tag=0

# sending up and receiving down
comm.Sendrecv([w[1,1:ncl+1],MPI.DOUBLE], up, tag, [w[nrl+1,1:ncl+1],MPI.DOUBLE], down, tag)
# sending down and receiving up
comm.Sendrecv([w[nrl,1:ncl+1],MPI.DOUBLE], down, tag, [w[0,1:ncl+1],MPI.DOUBLE], up, tag)

# Print result
comm.Barrier()
if rank==0:
    u[:,:]=0.0
    print("Final for rank 0")
    for i in range(nrl+2):
        print(w[i,:])
    print()

    for i in range(1,nprocs):
        comm.Recv([u,MPI.DOUBLE],source=MPI.ANY_SOURCE,tag=MPI.ANY_TAG,status=status)
        print("Final for rank",status.Get_source())
        for i in range(nrl+2):
            print(u[i,:])
        print()
else:
    comm.Send([w,MPI.DOUBLE],dest=0,tag=rank)


Exercise Review strong and weak scaling. The example programs are set up for strong scaling. Modify them by using constants for the local row and column sizes.

Simplify the validation step by taking two or three appropriate ranks and comparing the before and after values of the exchanged rows or columns.

Previous
Next
RC Logo RC Logo © 2026 The Rector and Visitors of the University of Virginia