Molecular docking simulations are widely used in computational drug discovery to predict molecular interactions at close distances. Specifically, these simulations aim to predict the binding poses between a small molecule and a macromolecular target, both referred to as ligand and receptor, respectively. The purpose of drug discovery is to identify ligands that effectively inhibit the harmful function of a certain receptor. In that context, molecular docking simulations are critical, by using them, the time-consuming preliminary tasks consisting of identifying potential drug candidates can be significantly shortened. Subsequent wet lab experiments can be carried out using only a narrowed list of promising ligands, hence reducing the overall cost of experiments.
AutoDock is one of the most widely used software applications for molecular docking simulations. Its main engine is a Lamarckian Genetic Algorithm (LGA), which combines a genetic algorithm and a local-search method to explore several molecular poses. The prediction of the best pose is based on the score, which is a function that evaluates the free energy (kcal/mol) of a ligand-receptor system. AutoDock is characterized by nested loops with variable upper bounds and divergent control structures. Moreover, the time-intensive score evaluations are typically invoked a couple of million of times within each LGA run. Based on its computation intensity, AutoDock suffers from long execution runtimes, which are mainly attributed to its inability to leverage its embarrassing parallelism. In recent years, an OpenCL-based implementation of AutoDock has been developed to accelerate its executions on a variety of devices including multi-core CPUs, GPUs, and even FPGAs.
In this work, we present our experiences porting and optimizing the OpenCL-based AutoDock onto the SX-Aurora Vector Engine. The OpenCL code is composed of a host and device parts that are maintained in the NEC VEO version. As the API functions of OpenCL and VEOffload resemble each other, porting the host code was very smooth. While the device part was easily ported too, an extra effort was required to increase the performance on SX-Aurora. For this, we used hardware-specific techniques that involve: appropriate data types for wider vectors, leveraging the multiple cores on the SX-Aurora, pushing outer into inner loops in score calculations and local search and using multi-process VEO to overcome OpenMP limitations in NUMA mode. Our evaluations were done on VE10B and VE20B models and compared to modern multicore CPUs and GPUs.