Since computational performance is critically important for simulations to be used as an effective tool to study and design dynamic systems, the computing performance gains offered by Graphics Processing Units (GPUs) cannot be ignored. Since the GPU is designed to execute a very large number of simultaneous tasks (nominally Single Instruction Multi-Data (SIMD)), recursive algorithms in general, such as the DCA, are not well suited to be executed on GPU-type architecture. This is because each level of recursion is dependent on the previous level. However, there are some ways that the GPU can be leveraged to increase computational performance when using the DCA to form and solve the equations of motion for articulated multibody systems with a very large number of degrees-of-freedom.

Computational performance of dynamic simulations is highly dependent on the nature of the underlying formulation and the number of generalized coordinates used to characterize the system. Therefore, algorithms that scale in a more desirable (lower order) fashion with the number of degrees-of-freedom are generally preferred when dealing with large (N > 10) systems. However, the utility of using simulations as a scientific tool is directly related to actual compute time. The DCA, and other top performing methods, have demonstrated the desirable property of the required compute time scaling linearly with (O(n)) with the number of degrees-of-freedom (n) and sublinearly (O(logn) performance when implemented in parallel. However for the DCA, total compute time could be further reduced by exploiting the large number of independent operations involved in the first few levels of recursion.

A simple chain-type pendulum example is used to explore the feasibility of using the GPU to execute the assembly and disassembly operations for the levels of recursion that contain enough bodies for this process to be computationally advantageous. A multi-core CPU is used to perform the operations in parallel using Open MP for the remaining levels. The number of levels of recursion that utilizes the GPU is varied from zero to all levels. The data corresponding to zero utilization of the GPU provides the reference compute-time in which the assembly and disassembly operations necessary at each level are performed in parallel using Open MP. The computational time required to simulate the system for one time-step where the GPU is utilized for various levels of recursion is compared to the reference compute time also varying the number of bodies in the system.

A decrease in the compute-time when using the GPU is demonstrated relative to the reference compute-time even for systems of moderate size n < 1000 for arrangements using the GPU. This is a lower number of bodies than was expected for this test case and confirms that the GPU can bring significant increases in computational efficiency for large systems, while preserving the attractive sub-linear scalability (w.r.t. compute time) of the DCA.

This content is only available via PDF.
You do not currently have access to this content.