Method for Software Pipelining on Graphical Processing Units

Authors

DOI:

https://doi.org/10.20535/2786-8729.6.2025.331193

Keywords:

software pipelining, Graphic Processing Unit, C-slowing, retiming, synchronous dataflow

Abstract

Graphics Processing Units (GPUs) play a significant role in high-end computations, including artificial intelligence. However, the GPU hardware is often underloaded. This forces an increased volume of GPU hardware to maintain a high throughput for task execution. The low loading of the GPU resources remains an actual problem and it needs to be solved now. Therefore, it is essential to seek methods that enhance GPU loading.

The research object is computational processes in modern processors, especially in GPUs. The purpose of this study is to review the software pipelining approach, its advantages and disadvantages, the techniques that can be used in it, including both instruction-level and decoupled versions, and to assess the effectiveness of this approach for the GPU.

To satisfy the requirements, different analysis methods were used. First, the architectural requirements to apply software pipelining were reviewed. Second, the original formulation and historical development of the approach were examined. Third, different levels of parallelisation to implement software pipelining were explored. Finally, C-slowing was proposed as an optimisation technique to overcome the adversities of the underutilisation of computational resources.

The research has revealed the abundance of proper software pipelining for GPU implementations. Whereas existing works review the possibilities of this technique, they are often overlooked in contrast to simpler multi-threading techniques. However, investigated researchers have defined the crucial limiting factor to computational resources as a constraint by memory overloading, specifically the pipelining registers. To address this, the C-slowing approach was suggested and theoretically evaluated. It demonstrated a possible increase of over 30% in GPU loading for the analysed algorithm, proving its applicability.

In conclusion, the software pipelining approach shows decent potential to optimise GPU algorithms, requiring further investigation. C-slowing could be utilised to handle the problem of underutilisation of computation.

Author Biographies

Artemii Vinokurov, National Technical University of Ukraine “Igor Sikorsky Kyiv Polytechnic Institute”, Kyiv

PhD student of the Computer Engineering Department of the Faculty of informatics and Computer Technique

Anatoliy Sergiyenko, National Technical University of Ukraine “Igor Sikorsky Kyiv Polytechnic Institute”, Kyiv

Professor of the Computer Engineering Department of the Faculty of informatics and Computer Technique, Doctor of Technical Sciences, Professor

References

G. Huang et al. “ALCOP: Automatic Load-Compute Pipelining in Deep Learning Compiler for AI-GPUs”. In: Proceedings of Machine Learning and Systems. Ed. by D. Song, M. Carbin, and T. Chen. Vol. 5. Curan, 2023, pp. 680–694. https://doi.org/10.48550/arXiv.2210.16691.

P. Sakdhnagool, A. Sabne, and R. Eigenmann, “RegDem: Increasing GPU performance via shared memory register spilling,” arXiv preprint arXiv:1907.02894, 2019, https://doi.org/10.48550/arXiv.1907.02894.

S. Darabi et al. “Morpheus: Extending the last-level cache capacity in GPU systems using idle GPU core resources”. In: 2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE. 2022, pp. 228–244, URL: https://doi.org/10.1109/MICRO56248.2022.00029.

V. H. Allan, R. B. Jones, R. M. Lee, S. I. Allan. “Software pipelining.” ACM Computing Surveys (CSUR), Vol. 27, No. 3, pp. 367 – 432. https://doi.org/10.1145/212094.212131.

P. Faraboschi, J. Fisher, C. Young. “Instruction scheduling for instruction-level parallel processors”. In: Proceedings of the IEEE, V. 89 (Dec. 2001), pp. 1638–1659. https://doi.org/10.1109/5.964443.

K. Ebciŏglu. “A compilation technique for software pipelining of loops with conditional jumps”. In: Proceedings of the 20th annual workshop on Microprogramming. 1987, pp. 69–79. https://doi.org/10.1145/255305.255317.

Y. Zhang et al. “Clustered Decoupled Software Pipelining on Commodity CMP”. In: Department of Information Science, Graduate School of Engineering, Utsunomiya University, Japan (2008). https://doi.org/10.1109/ICPADS.2008.113.

E. Raman et al. “Parallel-stage decoupled software pipelining”. In: Proceedings of the 6th Annual IEEE/ACM International Symposium on Code Generation and Optimization. CGO ’08. Boston, MA, USA: Association for Computing Machinery, 2008, pp. 114–123. https://doi.org/10.1145/1356058.1356074.

D. C. S. Lucas, G. Araujo. “The Batched DOACROSS loop parallelization algorithm”. In: 2015 International Conference on High Performance Computing & Simulation (HPCS). 2015, pp. 476–483. https://doi.org/10.1109/HPCS.2015.47.

A. Zou et al. “RTGPU: Real-Time GPU Scheduling of Hard Deadline Parallel Tasks With Fine-Grain Utilization”. In: IEEE Transactions on Parallel and Distributed Systems V.34. No5 (2023), pp. 1450–1465. https://doi.org/10.1109/TPDS.2023.3235439.

Y. E. Wang, G-Y. Wei, D. Brooks. “Benchmarking TPU, GPU, and CPU Platforms for Deep Learning”. 2019. arXiv: 1907.10701 [cs.LG]. https://doi.org/10.48550/arXiv.1907.10701.

Z. Jia et al. “Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking”. 2018. arXiv: 1804.06826[cs.DC]. https://doi.org/10.48550/arXiv.1804.06826.

C. M. Wittenbrink, E. Kilgariff, A. Prabhu. “Fermi GF100 GPU architecture”. In: IEEE Micro 31.2 (2011), pp. 50–59. https://doi.org/10.1109/MM.2011.24.

J. Ghorpade. “GPGPU Processing in CUDA Architecture”. In: Advanced Computing: An International Journal 3.1 (Jan. 2012), pp. 105–120. https://doi.org/10.5121/acij.2012.3109.

L. Hu, X. Che, S-Q. Zheng. “A Closer Look at GPGPU”. In: ACM Comput. Surv. Vol. 48. No. 4 (Mar. 2016). https://doi.org/10.1145/2873053.

Z. Zheng et al. “VersaPipe: A Versatile Programming Framework for Pipelined Computing on GPU”. In: 2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). 2017, pp. 587–599. https://doi.org/10.1145/3123939.3123978.

C. Oh et al. “GOPipe: A Granularity-Oblivious Programming Framework for Pipelined Stencil Executions on GPU”. In: Proceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques. Association for Computing Machinery, 2020.https://doi.org/10.1145/3410463.3414656.

N. C. Crago et al. “WASP: Exploiting GPU Pipeline Parallelism with Hardware-Accelerated Automatic Warp Specialization”. In: 2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA). 2024, pp. 1–16. https://doi.org/10.1109/HPCA57654.2024.00086.

S. Raskar et al. “Implementation of Dataflow Software Pipelining for Codelet Model”. In: Proceedings of the 2023 ACM/SPEC International Conference on Performance Engineering. ICPE ’23. Coimbra, Portugal: Association for Computing Machinery, 2023, pp. 161–172. https://doi.org/10.1145/3578244.3583734.

H. Wei et al. “Minimizing communication in rate-optimal software pipelining for stream programs”. In: Proceedings of the 8th Annual IEEE/ACM International Symposium on Code Generation and Optimization. CGO ’10. Toronto, Ontario, Canada: Association for Computing Machinery, 2010, pp. 210–217. https://doi.org/10.1145/1772954.1772984.

S. Laine, T. Karras. “High-performance software rasterization on GPUs”. In: Proceedings of the ACM SIGGRAPH Symposium on High Performance Graphics. HPG ’11. Vancouver, British Columbia, Canada: Association for Computing Machinery, 2011, pp. 79–88. https://doi.org/10.1145/2018323.2018337.

M. Gordon, W. Thies, S. Amarasinghe. “Exploiting Coarse-Grained Task, Data, and Pipeline Parallelism in Stream Programs”. In: ASPLOS XII: Proceedings of the 12th international conference on Architectural support for programming languages and operating systems. Oct. 2006, pp. 151–162. https://doi.org/10.1145/1168857.1168877 .

R. G. Singh, C. Scholliers. “Gaiwan: A size-polymorphic typesystem for GPU programs”. In: Science of Computer Programming Vol. 230 (2023), p. 102989. https://doi.org/10.1016/j.scico.2023.102989.

A. Udupa, R. Govindarajan, M. J. Thazhuthaveetil. “Software Pipelined Execution of Stream Programs on GPUs”. In: 2009 International Symposium on Code Generation and Optimization. 2009, pp. 200–209. https://doi.org/10.1109/CGO.2009.20.

J. M. Codina, J. Llosa, A. Gonz´alez. “A comparative study of modulo scheduling techniques”. In: ICS ’02. New York, USA: Association for Computing Machinery, 2002, pp. 97–106. https://doi.org/10.1145/514191.514208.

P. Pfahler, G. Piepenbrock. “A comparison of modulo scheduling techniques for software pipelining”. In: Compiler Construction. Ed. by Tibor Gyim´othy. Berlin, Heidelberg: Springer, 1996, pp. 18–32. https://doi.org/10.1007/3-540-61053-7_50.

A. Marongiu, V. Nelis, P. Yomsi. “Manycore Platforms”. In: High Performance Embedded Computing. 2022, pp. 15–32. https://doi.org/10.1201/9781003338413-2.

J. Hestness, S. W. Keckler, D. A. Wood. “GPU Computing Pipeline Inefficiencies and Optimization Opportunities in Heterogeneous CPU-GPU Processors”. In: 2015 IEEE International Symposium on Workload Characterization. 2015, pp. 87–97. https://doi.org/10.1109/IISWC.2015.15.

D. Gerzhoy, D. Yeung. “Pipelined CPU-GPU Scheduling to Reduce Main Memory Accesses”. In: Proceedings of the International Symposium on Memory Systems. MEMSYS ’21. Washington DC, USA: Association for Computing Machinery, 2023. https://doi.org/10.1145/3488423.3519319.

E. A. Lee and D. G. Messerschmitt. “Static Scheduling of Synchronous Data Flow Programs for Digital Signal Processing”. In: IEEE Transactions on Computers. Vol. C-36. No.1 (Jan. 1987), pp. 24–35. https://doi.org/10.1109/TC.1987.5009446.

K. K. Parhi, C. Y. Wang, and A. P. Brown. “Synthesis of control circuits in folded pipelined DSP architectures”. In: IEEE Journal of Solid-State Circuits. Vol. 27. No. 1, Jan. 1992, pp. 29–43. https://doi.org/10.1109/4.109555.

A. Sharma, C. Ebeling, and S. Hauck. “PipeRoute: a pipelining-aware router for reconfigurable architectures”. In: IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems. Vol. 25. No. 3, Mar. 2006, pp. 518–532. https://doi.org/10.1109/TCAD.2005.853691.

K. K. Parhi. “Algorithm transformation techniques for concurrent processors”. In: Proceedings of the IEEE. Vol. 77. No. 12, Dec. 1989, pp. 1879–1895. https://doi.org/10.1109/5.48830.

Downloads

Published

2025-09-19

How to Cite

[1]
A. Vinokurov and A. Sergiyenko, “Method for Software Pipelining on Graphical Processing Units”, Inf. Comput. and Intell. syst. j., no. 6, pp. 27–41, Sep. 2025.