In many cases, it is impossible to divide a parallel data structure so that
each process has exactly the same amount of data. It may not even be
desirable, if the amount of work to be done varies. Modify your code so that
each process can have a different number of rows of the distributed mesh.
You may want to use these MPI routines in your solution:
MPI_Gather
MPI_Gatherv