MCS 572 Cray C90 & T3D Group Project Suggestions
Fall 1995

Professor F. B. HANSON

DUE Wednesday 08 November 1995 in class.

Two(2) copies of the report are due Wednesday 08 November 1995 in class (one graded copy will be returned and the other will be used for the progress report and next class proposal to PSC). Students will make short presentations of group project results in class, starting on Monday 06 November 1995.

CAUTION: Projects should have sufficient work to effective utilize the both the C90 and T3D, by a comparison of the two platforms, but should not be so time consuming as to severely affect the performance of other users. Write a group (1 .leq. group .leq. 2) with good load balancing among the group members) report that is a short paper (8 or 15 or so pages) as if for publication (e.g. with

abstract (short description of problem and results)
introduction (motivate your problem for the class, citing prior work)
problem or method
results and discussion (should include theoretical explanations of interesting results and graphs)
conclusions (brief)
acknowledgements (give thanks to others that helped you)
references (list papers and books that you used as sources)
appendices: compiler informational code listing (*.l file with vector loop marking from "-em" option of cft77), supporting timings and if relevant assembly listings. Also have similar data for the T3D.

You are welcome to make up your own projects, but you should discuss this with Professor Hanson before hand for suggestions. Also let him know what ever project you select, because even the following ideas are very broad.

WARNING: If you use test or sample floating point arrays in your project, make sure they are genuine and random floating point, i.e., do not use trivial integers or numbers with patterns. Consult the class local user's guide for how to run a scalar job to use as a reference measurement.

The Projects

Own Project. A Cray project or your own design, such as optimization of some method connected with your thesis research area, graphical visualization, another course, or some interesting science-engineering area.
Statistics Project. Generate suitable sets of random numbers (make sure they are floating point), each with a different sample size N. `ranf' is a very good random number generator, but check it out yourself. Describe how you tested the randomness of your data, e.g., test for a uniform random distribution. For each set, compute basic statistics, like mean, variance and Chi-Square test in as efficient vector manner as possible (i.e., make use of the extended Fortran90 intrinsic sum function `sum' on the Cray (replaces old SCILIB `ssum', but needs an `intrinsic sum' declaration)). Plot T versus N and T versus p. Estimate or compute and plot the Amdahl vector fraction as a function of N. Compare speedups and efficiencies relative to N. Is the Amdahl law operative as the problem size N becomes large? Develop your own performance model that is appropriate for the behavior of the timing data with number of processors p, sample size N and Chi-Square bin size Nb. Does your performance model account for deviations in Amdahl's law?
Loop Unrolling and Vector Touches. Measure actual performance on the Cray and advantage of "unrolling" loops for several depths of unrolling for matrix-vector and matrix-matrix multiplication. Compare results to the theoretical results presented in class from J.J. Dongarra and S.C. Eisenstat (1984), "Squeezing the Most out of an Algorithm in CRAY Fortran," ACM Trans. Math. Software 10 (No. 3) pp. 221-230 and J.J. Dongarra and A.R. Hinds (1979), "Unrolling Loops in FORTRAN," Software-Practice and Experience 9, pp. 219-229 and W.R. Cowell and C.P. Thompson, "Transforming Fortran DO Loops to Improve Performance on Vector Architectures, "ACM Trans. Math. Software 12 (No. 4, Dec. 1986), pp. 324-353. See also J. Levesque and J.W. Williamson, "A Guidebook to Fortran on Supercomputers," 1989. Find out how to get a pseudo-assembler listing of the principal parts of your program for analysis (check the `man cft77' for the `-s [calfile]' option and explore simple examples). Is there an optimal unrolling depth due to the fact that there is a hardware limit of only 8 vector registers on the Cray. Present results in MegaFlops versus unrolling depth and determine a maximum, if any. Measure the cost of scalar touches relative to vector touches, and use the ratio to correct the touch count in your report discussion. Try both horizontal (LHS) or vertical (RHS) unrolling for matrix multiplication.
Vector Stride Effects. Measure the degradation of performance due to processing non-integral multiples of the Cray vector stride (64). Use a wide range of vector sizes and several types of vector and array operations. Also, make several measurements around each multiple of the vector register length (64). Can you distinguish section, subsection and bank conflicts? Can you find programming methods, like block decomposition or loop restructuring, that can get better performance than that of the Cray does automatically on an unaltered loop in optimization mode? Also, try to measure if there is really little overhead in the Y-MP vectorization by comparing your vectorization results with the scalar results when vectorization is turned off by the "CDIR$ NOVECTOR"
Row versus Column Oriented Multiplication Loops. Determine regions of array size where there are efficiency advantages on the Cray using column referencing as opposed to row referencing in reordering numerical linear algebra multiple loops (matrix-vector and matrix-matrix multiplication). Is the simple Fortran column environment argument valid, and if not why not? How strong is the dependence on loop iteration size N? What about rectangular (non-square) matrices. Is there a "shape effect"? What about stride effects? Make sure your floating point arrays are genuine. (See Dongarra, Gustavson and Karp, SIAM Review, Vol. 26, 1984, pp. 91-122; for the CRAY-1).
Row versus Column Oriented LU Decomposition Loops. Determine regions of array size where there are efficiency advantages on the Cray using column referencing as opposed to row referencing in reordering LU decomposition multiple loops. Is the simple Fortran column environment argument valid, and if not why not? How strong is the dependence on loop iteration size N? What about rectangular (non-square) matrices. Make sure your floating point arrays are genuine. (See Dongarra, Gustavson and Karp, SIAM Review, Vol. 26, 1984, pp. 91-122; for the CRAY-1).
Validity of Hanson's "Avoid These Things". Investigate a number of Professor Hanson's Rules of Thumb about "Avoiding Certain Optimization Hindering Constructs". Find out the validity on loops (if loops were involved) with sufficient work (ie., bigger than the toy class examples). Find regions of work size, if any, where each rule works. For example: What is the quantitative difference in overhead between common and subroutine argument passing? How much does inlining subroutines and functions save? statement.
Iteration Methods. Make a comparison of the performance of Jacobi and Gauss-Seidel methods for Elliptic Partial Differential equations. Gauss-Seidel is better for serial computers, but what about parallel and vector computers?
Compare Strassen's matrix multiplication algorithm with the usual form. See D. Bailey, "Extra High Speed Multiplication on the Cray-2", SIAM J. Sci. Comp. 9(3), May 1988, p. 603-607.
Test whether higher levels of vectorization give higher performance. For instance, does the command `cf77 -Zv -Wf"-em" [fn].f' lead to faster executables than `cf77 -Wf"-em" [fn].f' for matrix multiplication or some other application. Are there similar comparisons for the T3D?

MCS 572 Cray C90 & T3D Group Project Suggestions Fall 1995

Professor F. B. HANSON

The Projects

MCS 572 Cray C90 & T3D Group Project Suggestions
Fall 1995