MCS 572 Cray C90 & T3D Group Project Suggestions
Fall 1995
Professor F. B. HANSON
DUE Wednesday 08 November 1995 in class.
Two(2) copies of the report are due Wednesday 08 November 1995 in class
(one graded copy will be returned and the other will be used for the
progress report and next class proposal to PSC).
Students will make short presentations of group project results in class,
starting on Monday 06 November 1995.
CAUTION: Projects should have sufficient work to effective utilize
the both the C90 and T3D, by a comparison of the two platforms, but should
not be so time consuming as to severely affect the performance of other users.
Write a group (1 .leq. group .leq. 2) with good load balancing
among the group members) report that is a short paper (8 or 15 or so
pages) as if for publication (e.g. with
- abstract (short description of problem and results)
- introduction (motivate your problem for the class, citing prior
work)
- problem or method
- results and discussion (should include theoretical explanations of
interesting results and graphs)
- conclusions (brief)
- acknowledgements (give thanks to others that helped you)
- references (list papers and books that you used as sources)
- appendices: compiler informational code listing (*.l file with
vector loop marking from "-em" option of cft77), supporting timings and if
relevant assembly listings. Also have similar data for the T3D.
You are welcome to make up
your own projects, but you should discuss this with Professor Hanson
before hand for suggestions. Also let him know what ever project
you select, because even the following ideas are very broad.
WARNING: If you use test or sample floating point arrays in your
project, make sure they are genuine and random floating point, i.e.,
do not use trivial integers or numbers with patterns. Consult the
class local user's guide for how to run a scalar job to use as a
reference measurement.
The Projects
- Own Project.
A Cray project or your own design, such as optimization
of some method connected with your thesis research area, graphical
visualization, another course, or some interesting science-engineering area.
- Statistics Project.
Generate suitable sets of random
numbers (make sure they are floating point),
each with a different sample size N. `ranf' is a very good random
number generator, but check it out yourself.
Describe how you tested the
randomness of your data, e.g., test for a uniform random distribution.
For each set, compute
basic statistics, like mean, variance and Chi-Square test in as
efficient vector manner as possible (i.e., make use of the extended
Fortran90 intrinsic sum function `sum' on the Cray (replaces old SCILIB
`ssum', but needs an `intrinsic sum' declaration)). Plot T versus N and T
versus p. Estimate or compute and plot
the Amdahl vector fraction as a function of N. Compare speedups and
efficiencies relative to N. Is the Amdahl law
operative as the problem size N becomes large?
Develop your own performance model that is appropriate for the behavior
of the timing data with number of processors p, sample size N and
Chi-Square bin size Nb.
Does your performance model account for deviations in Amdahl's law?
- Loop Unrolling and Vector Touches.
Measure actual performance on the Cray and advantage of "unrolling" loops for
several depths of unrolling for matrix-vector and matrix-matrix
multiplication. Compare results to the theoretical results presented in class
from J.J. Dongarra and S.C. Eisenstat (1984), "Squeezing the Most out of
an Algorithm in CRAY Fortran," ACM Trans. Math. Software 10 (No. 3)
pp. 221-230 and J.J. Dongarra and A.R. Hinds (1979),
"Unrolling Loops in FORTRAN," Software-Practice and Experience 9,
pp. 219-229 and W.R. Cowell and C.P. Thompson, "Transforming Fortran
DO Loops to Improve Performance on Vector Architectures, "ACM Trans.
Math. Software 12 (No. 4, Dec. 1986), pp. 324-353. See also J. Levesque
and J.W. Williamson, "A Guidebook to Fortran on Supercomputers," 1989.
Find out how to get a pseudo-assembler listing of the principal
parts of your program for analysis (check the `man cft77' for
the `-s [calfile]' option and explore simple examples).
Is there an optimal unrolling depth due to the fact that
there is a hardware limit of only 8 vector registers on the Cray.
Present results in MegaFlops versus unrolling depth and determine a maximum,
if any. Measure the cost of scalar touches relative to vector touches, and
use the ratio to correct the touch count in your report discussion.
Try both horizontal (LHS) or vertical (RHS) unrolling for matrix
multiplication.
- Vector Stride Effects.
Measure the degradation of performance due to processing non-integral
multiples of the Cray vector stride (64). Use a wide range of vector
sizes and several types of vector and array operations.
Also, make several measurements around each multiple of the vector
register length (64). Can you distinguish section, subsection and bank
conflicts? Can you find programming methods, like block decomposition or loop
restructuring, that can get better performance than that of the Cray
does automatically on an unaltered loop in optimization mode?
Also, try to measure if there is really little overhead in the Y-MP
vectorization by comparing your vectorization results with the scalar
results when vectorization is turned off by the "CDIR$ NOVECTOR"
- Row versus Column Oriented Multiplication Loops.
Determine regions of array size where there are
efficiency advantages on the Cray using
column referencing as opposed to row referencing in reordering
numerical linear algebra multiple loops (matrix-vector and
matrix-matrix multiplication).
Is the simple Fortran column environment
argument valid, and if not why not? How strong is the dependence on
loop iteration size N? What about rectangular (non-square) matrices. Is there
a "shape effect"? What about stride effects?
Make sure your floating point arrays are genuine.
(See Dongarra, Gustavson and Karp, SIAM
Review, Vol. 26, 1984, pp. 91-122; for the CRAY-1).
- Row versus Column Oriented LU Decomposition Loops.
Determine regions of array size where there are
efficiency advantages on the Cray using
column referencing as opposed to row referencing in reordering
LU decomposition multiple loops. Is the simple Fortran column environment
argument valid, and if not why not? How strong is the dependence on
loop iteration size N? What about rectangular (non-square) matrices.
Make sure your floating point arrays are genuine.
(See Dongarra, Gustavson and Karp, SIAM
Review, Vol. 26, 1984, pp. 91-122; for the CRAY-1).
- Validity of Hanson's "Avoid These Things".
Investigate a number of Professor Hanson's Rules of Thumb about "Avoiding
Certain Optimization Hindering Constructs". Find out the validity
on loops (if loops were involved) with sufficient work (ie., bigger than
the toy class examples). Find regions of work size, if any, where each
rule works. For example: What is the quantitative difference in overhead
between common and subroutine argument passing? How much does inlining
subroutines and functions save?
statement.
- Iteration Methods.
Make a comparison of the performance of Jacobi
and Gauss-Seidel methods for Elliptic Partial Differential equations.
Gauss-Seidel is better for serial computers, but what about parallel
and vector computers?
- Compare Strassen's matrix multiplication algorithm with the usual
form.
See D. Bailey, "Extra High Speed Multiplication on the Cray-2", SIAM J. Sci.
Comp. 9(3), May 1988, p. 603-607.
- Test whether higher levels of vectorization give higher performance.
For instance, does the command `cf77 -Zv -Wf"-em" [fn].f' lead to faster
executables than `cf77 -Wf"-em" [fn].f' for matrix multiplication or
some other application. Are there similar comparisons for the T3D?