LOOP PARALLELIZATION AUTOMATION FOR GRAPHICS PROCESSING UNITS
DOI:
https://doi.org/10.20535/2708-4930.1.2020.216044Keywords:
CUDA, general-purpose computing on graphics processing units, loop optimization, parallelization methods.Abstract
A technology that allows extending GPU capabilities to deal with data volumes that outfit internal GPU’s memory capacity is proposed. It involves loop tiling and data serialization and can be applied to utilize clusters consisting of several GPUs. Applicability criterion is specified and a semi-automatic proof-of-concept software tool is implemented. The experiment to demonstrate the feasibility of the proposed technology is described.
References
Harris M. J. Real-time cloud simulation and rendering : a dissertation for a Ph.D. degree in the department of computer science. Chapel Hill, NC : University of North Carolina, 2003. 151 p.
PIPS: Automatic Parallelizer and Code Transformation Framework. PIPS4U : website. URL: http://pips4u.org (accessed: 17.08.2020).
Polyhedral parallel code generation for CUDA / Verdoolaege S. et al. ACM Trans. Architec. Code Optim. 2013. Vol. 9, № 4, Art. 54. P. 1–23.
Split tiling for GPUs: automatic parallelization using trapezoidal tiles / T. Grosser et al. GPGPU-6 : Proc. 6th Workshop on General Purpose Processor Using Graphics Processing Units, March 16, 2013. New York : Association for Computing Machinery, 2013. P. 24–31.
Automatic parallelization of tiled loop nests with enhanced fine-grained parallelism on GPUs / P. Di et al. Proc. 41st International Conference on Parallel Processing, September 10–13, 2012. Washington, D.C. : IEEE Computer Society, 2012. P. 1–12.
Automated design of parallel programs for heterogeneous platforms using algebra-algorithmic tools / Doroshenko A., Beketov O., Bondarenko M., Yatsenko O. ICTERI 2019 : Post Proc. 15th Int. Conf. “ICT in Education, Research and Industrial Applications. Integration, Harmonization and Knowledge Transfer”, June 12–15, 2019. CCIS. Vol. 1175. Cham : Springer, 2020. P. 3–23.
Doroshenko A., Beketov O. Large-scale loops parallelization for GPU accelerators. ICTERI 2019 : Proc. 15th Int. Conf. “ICT in Education, Research and Industrial Applications. Integration, Harmonization and Knowledge Transfer”, June 12–15, 2019. Cham : Springer, 2019. P. 82–89.
Wolfe M. More iteration space tiling. Supercomputing’89 : Proc. ACM/IEEE Conference on Supercomputing, November 1989. New York : Association for Computing Machinery, 1989. P. 655–664.
Bernstein A. J. Analysis of programs for parallel processing. IEEE transactions on electronic computers. 1966. Vol. EC-15, № 5. P. 757–763.
Dereniowski D., Kubale M. Cholesky factorization of matrices in parallel and ranking of graphs. Proc. 5th International Conference on Parallel Processing and Applied Mathematics, September 7–10, 2003. Berlin : Springer, 2004. P. 985–992.
Gentle J. E. Gaussian elimination. Numerical Linear Algebra for Applications in Statistics. Berlin : Springer, 1998. P. 87–91.
Doroshenko A., Shevchenko R. A rewriting framework for rule-based programming dynamic applications. Fundamenta Informaticae. 2006. Vol. 72, № 1–3. P. 95–108.
GitHub repository. github : website. URL: https://github.com/o-beketov/matmul (accessed: 17.08.2020).
Aarseth S. J. Gravitational N-body simulations. Cambridge : Cambridge University Press, 2003. 430 p.