Performance Evaluation of a Randomized Input Vector for Square Root Function between Single-threaded and Multi-threaded OpenMP Implementations

Michael Anthony Jay B. Regis


Parallel processors have been the recent trend in the mainstream computing market. But taking advantage of multiple processors is still a challenge since most algorithms are written in a sequential-based manner. An example of a sequential algorithm that is commonly used and implemented is the "square root extraction." Recently, the popularity of inverse square root in the image processing community poses even more computational challenge due to large amount of data inputs. In this paper, the problem of square root extraction on a large data input was addressed using a parallel-based algorithm in OpenMP. The algorithm used varying amounts of thread count and scheduling methods to handle the square root operations then it was executed in a parallel for loop statement. Results showed that the optimum parallel implementation was more than four times its serial counterpart given that the input vector of 20,000 elements is executed in a four physical core (8 logical core) shared memory platform processor. It is also important to note that using too few (i.e. threads lesser than the physical core count) or too many (i.e. threads equal to the input vector) would drastically reduce performance. In this study, a 1:16 logical core count to thread ratio is the optimum. The efficiency of multi-threaded operation using static and guided scheduling is 55.57% and 54.60%, respectively.


Parallel processors; square root; OpenMP; shared memory platform


Chandra, R., Dagum, L., Kohr, D., Maydan, D., McDonald, J., & Menon, R. (2001). Parallel Programming in OpenMP. (D. E. Penrose, Ed.) San Francisco, California: Morgan Kaufmann Publishers.

Chapman, T., & Kalyanaraman, A. (2011). An OpenMP Algorithm and Implementation for Clustering Biological Graphs. Proceedings of the first workshop on Irregular applications: architectures and algorithm (pp. 3-10). New York: ACM.

El-Rewini, H., & Abd-El-Barr, M. (2005). Advanced Computer Architecture and Parallel Processing. (A. Y. Zomaya, Ed.) Hoboken, New Jersey: John Wiley & Sons Inc.

Ercegovac, M. D., Lang, T., Muller, J.-M., & Tisserand, A. (2000, July). Reciprocation, Square Root, Inverse Square Root, and Some Elementary Functions Using Small Multipliers. IEEE Transactions on Computers, 49, 1-10.

Grama, A., Gupta, A., Karypis, G., & Kumar, V. (2003). Introduction to Parallel Computing. Harlow, Essex: Addison Wesley.

Ito, M., Takagi, N., & Yajima, S. (1995). Efficient Initial Approximation and Fast Converging Methods for Division and Square Root. In S. Knowles, & W. McAllister (Ed.), Proceedings 12th IEEE Symposium on Computer Arithmetic, (pp. 2-9).

Pase, D. M., & Eckl, M. A. (2006). A Comparison of Single-Core and Dual-Core Opteron. (Linux Cluster Institute ) LCI Conference. Linux Cluster Institute. Retrieved from

Sanders, J., & Kandrot, E. (2011). CUDA by Example (An Introduction To General Purpose GPU Programming). Boston, Massachusetts: Pearson Education Inc.

Shen, J. P., & Lipasti, M. H. (2013). Modern Processor Design: Fundamentals of Superscalar Processors. Waveland Press, Inc.

Weng, T.-H., Huang, S.-W., Perng, R.-K., Hsu, C.-H., & Li, K.-C. (2009). A Practical OpenMP Implementation of Bit-Reversal for Fast Fourier Transform. (P. Mueller, J.-N. Cao, & C.-L. Wang, Eds.) Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering (LNICST), 18, pp. 206 – 216.

Full Text: JST_2015 05


  • There are currently no refbacks.

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.