MCS 572 Cray C90 & T3D Group Project Suggestions
Fall 1996
Professor F. B. HANSON
DUE Friday 06 December 1996 in class.
Two(2) copies of the report are due Friday 06 December 1996 in class
(one graded copy will be returned and the other will be used for the
progress report and next class proposal to PSC).
Students will make short presentations of group project results in class,
starting on Wednesday 04 December 1996.
CAUTION: Projects should have sufficient work to effective utilize
the both the C90 and T3D, by a comparison of the two platforms, but should
not be so time consuming as to severely affect the performance of other users.
Write a group (1 < group < 2) with good load balancing
among the group members) report that is a short paper (8 or 15 or so
pages) as if for publication, i.e., with
abstract (short description of problem and results)
executive summary (give an itemized brief summary of your paper)
introduction (motivate your problem for the class, citing prior
work)
problem or method
results and discussion (should include theoretical explanations of
interesting results and graphs; explain results whether good or bad)
conclusions (brief)
acknowledgements (give thanks to others that helped you and to the
Pittsburgh Supercomputer Center of use of the Cray C90 and T3D)
references (list articles, books and other documents that you
used as sources)
appendices: compiler informational code listing (*.l file with
vector loop marking from "-em" option of cft77 or corresponding information
if you use the cc compiler), supporting timings and if
relevant assembly listings. Also have similar data for the T3D.
You are welcome to make up
your own projects (see the first suggestion), but you should discuss this
with Professor Hanson
before hand for suggestions. Also let him know what ever project
you select, because even the following ideas are very broad.
WARNING: If you use test or sample floating point arrays in your
project, make sure they are genuine and random floating point, i.e.,
do not use trivial integers or numbers with patterns. Consult the
class local user's guide for how to run a scalar job to use as a
reference measurement. Also, if your project is similar to the one you did
on the Convex SPP1200, then you must give an extensive comparison to your
SPP1200 project, so that the work is comparable to what you would do for
new project topic.
The Projects
Own Project.
A Cray project or your own design, such as optimization
of some method connected with your thesis research area, graphical
visualization, another course, or some interesting science-engineering area.
Statistics Project.
Generate suitable sets of random
numbers (make sure they are floating point),
each with a different sample size N. `ranf' is a very good random
number generator, but check it out yourself.
Describe how you tested the
randomness of your data, e.g., test for a uniform random distribution.
For each set, compute
basic statistics, like mean, variance and Chi-Square test in as
efficient vector manner as possible (i.e., make use of the extended
Fortran90 intrinsic sum function `sum' on the Cray.
Plot T versus N and T
versus p. Estimate or compute and plot
the Amdahl vector fraction as a function of N. Compare speedups and
efficiencies relative to N. Is the Amdahl law
operative as the problem size N becomes large?
Develop your own performance model that is appropriate for the behavior
of the timing data with number of processors p, sample size N and
Chi-Square bin size Nb.
Does your performance model account for deviations in Amdahl's law?
Loop Unrolling and Vector Touches.
Measure actual performance on the Cray and advantage of "unrolling" loops for
several depths of unrolling for matrix-vector and matrix-matrix
multiplication. Compare results to the theoretical results presented in class
from J.J. Dongarra and S.C. Eisenstat (1984), "Squeezing the Most out of
an Algorithm in CRAY Fortran," ACM Trans. Math. Software 10 (No. 3)
pp. 221-230 and J.J. Dongarra and A.R. Hinds (1979),
"Unrolling Loops in FORTRAN," Software-Practice and Experience 9,
pp. 219-229 and W.R. Cowell and C.P. Thompson, "Transforming Fortran
DO Loops to Improve Performance on Vector Architectures, "ACM Trans.
Math. Software 12 (No. 4, Dec. 1986), pp. 324-353. See also J. Levesque
and J.W. Williamson, "A Guidebook to Fortran on Supercomputers," 1989.
Find out how to get a pseudo-assembler listing of the principal
parts of your program for analysis (check the `man cft77' for
the `-s [calfile]' option and explore simple examples).
Is there an optimal unrolling depth due to the fact that
there is a hardware limit of only 8 vector registers on the Cray.
Present results in MegaFlops versus unrolling depth and determine a maximum,
if any. Measure the cost of scalar touches relative to vector touches, and
use the ratio to correct the touch count in your report discussion.
Try both horizontal (LHS) or vertical (RHS) unrolling for matrix
multiplication.
Vector Stride Effects.
Measure the degradation of performance due to processing non-integral
multiples of the Cray vector stride (64). Use a wide range of vector
sizes and several types of vector and array operations.
Also, make several measurements around each multiple of the vector
register length (64). Can you distinguish section, subsection and bank
conflicts? Can you find programming methods, like block decomposition or loop
restructuring, that can get better performance than that of the Cray
does automatically on an unaltered loop in optimization mode?
Also, try to measure if there is really little overhead in the Y-MP
vectorization by comparing your vectorization results with the scalar
results when vectorization is turned off by the "CDIR$ NOVECTOR"
Row versus Column Oriented Multiplication Loops.
Determine regions of array size where there are
efficiency advantages on the Cray using
column referencing as opposed to row referencing in reordering
numerical linear algebra multiple loops (matrix-vector and
matrix-matrix multiplication).
Is the simple Fortran column environment
argument valid, and if not why not? How strong is the dependence on
loop iteration size N? What about rectangular (non-square) matrices. Is there
a "shape effect"? What about stride effects?
Make sure your floating point arrays are genuine.
(See Dongarra, Gustavson and Karp, SIAM
Review, Vol. 26, 1984, pp. 91-122; for the CRAY-1).
Row versus Column Oriented LU Decomposition Loops.
Determine regions of array size where there are
efficiency advantages on the Cray using
column referencing as opposed to row referencing in reordering
LU decomposition multiple loops. Is the simple Fortran column environment
argument valid, and if not why not? How strong is the dependence on
loop iteration size N? What about rectangular (non-square) matrices.
Make sure your floating point arrays are genuine.
(See Dongarra, Gustavson and Karp, SIAM
Review, Vol. 26, 1984, pp. 91-122; for the CRAY-1).
Validity of Hanson's "Avoid These Things".
Investigate a number of Professor Hanson's Rules of Thumb about "Avoiding
Certain Optimization Hindering Constructs". Find out the validity
on loops (if loops were involved) with sufficient work (ie., bigger than
the toy class examples). Find regions of work size, if any, where each
rule works. For example: What is the quantitative difference in overhead
between common and subroutine argument passing? How much does inlining
subroutines and functions save?
statement.
Iteration Methods.
Make a comparison of the performance of Jacobi
and Gauss-Seidel methods for Elliptic Partial Differential equations.
Gauss-Seidel is better for serial computers, but what about parallel
and vector computers? (See Ortega, Intro. Parallel and Vector
Solution of Linear Systems, 1988 and related papers.)
Test whether higher levels of vectorization give higher performance.
For instance, does the command `cf77 -Zv -Wf"-em" [fn].f' lead to faster
executables than `cf77 -Wf"-em" [fn].f' for matrix multiplication or
some other application. Are there similar comparisons for the T3D?
Web Source: http://www.math.uic.edu/~hanson/crayproject.html