NAG CL Interface
Multithreading

1 Thread Safety

In multithreaded applications, each thread in a team processes instructions independently while sharing the same memory address space. For these applications to operate correctly any functions called from them must be thread safe. That is, any global variables they contain are guaranteed not to be accessed simultaneously by different threads, as this can compromise results. This can be ensured through appropriate synchronization, such as that found in OpenMP.
When a function is described as thread safe we are considering its behaviour when it is called by multiple threads. It is worth noting that a thread unsafe function can still, itself, be multithreaded. A team of threads can be created inside the function to share the workload as described in Section 2.
The NAG CL Interface is thread safe by design: the functions do not use global variables and all communication between them is via argument lists, and thus can be safely called simultaneously by multiple threads in your program.

1.1 Functions with Function Arguments

Some Library functions require you to supply a function and to pass the name of the function as an actual argument in the call to the Library function. For many of these Library functions, the supplied function interface includes an array parameter (called comm) specifically for you to pass information to the supplied function without the need for global variables.
If you need to provide your supplied function with more information than can be given via the interface argument list, then you are advised to check, in the relevant Chapter Introduction, whether the Library function you intend to call has an equivalent reverse communication interface. These have been designed specifically for problems where user-supplied function interfaces are not flexible enough for a given problem, and their use should eliminate the need to provide data through global variables. Where reverse communication interfaces are not available, it is usual to use global variables containing the required data that is accessible from both the supplied function and from the calling program. It is thread safe to do this only if any global data referenced is made threadprivate by OpenMP or is updated using appropriate synchronisation, thus avoiding the possibility of simultaneous modification by different threads.
Thread safety of user-supplied functions is also an issue with a number of functions in multi-threaded implementations of the NAG Library, which may internally parallelize around the calls to the user-supplied functions. This issue affects not just global variables but also how the comm array may be used. In these cases, synchronisation may be needed to ensure thread safety. Chapter X06 provides functions which can be used in your supplied function to determine whether it is being called from within an OpenMP parallel region. If you are in doubt over the thread safety of your program you are advised to contact NAG for assistance.

1.2 Input/Output

When using the NAG CL Interface in multi-threaded applications we recommend that when using its error mechanism, the output is switched off (by setting fail:print=Nag_FALSE).

1.3 Implementation Issues

In very rare cases we are unable to guarantee the thread safety of a particular specific implementation. Note also that in some implementations, the Library is linked with one or more vendor libraries to provide, for example, efficient BLAS functions. NAG cannot guarantee that any such vendor library is thread safe. Please consult the Users' Note for your implementation for any additional implementation-specific information.

2 Parallelism

2.1 Introduction

The time taken to execute a function from the NAG Library has traditionally depended, to a large degree, on the serial performance capabilities of the processor being used. In an effort to go beyond the performance limitations of a single core processor, multithreaded implementations of the NAG Library are available. These implementations divide the computational workload of some functions between multiple cores and executes these tasks in parallel. Traditionally, such systems consisted of a small number of processors each with a single core. Improvements in the performance capabilities of these processors happened in line with increases in clock frequencies. However, this increase reached a limit which meant that processor designers had to find another way in which to improve performance; this led to the development of multicore processors, which are now ubiquitous. Instead of consisting of a single compute core, multicore processors consist of two or more, which typically comprise at least a Central Processing Unit and a small cache. Thus making effective use of parallelism, wherever possible, has become imperative in order to maximize the performance potential of modern hardware resources, and the multithreaded implementations.
The effectiveness of parallelism can be measured by how much faster a parallel program is compared to an equivalent serial program. This is called the parallel speedup. If a serial program has been parallelized then the speedup of the parallel implementation of the program is defined by dividing the time taken by the original serial program on a given problem by the time taken by the parallel program using n cores to compute the same problem. Ideal speedup is obtained when this value is n (i.e., when the parallel program takes 1nth the time of the original serial program). If speedup of the parallel program is close to ideal for increasing values of n then we say the program has good scalability.
The scalability of a parallel program may be less than the ideal value because of two factors:
  1. (a)the overheads introduced as part of the parallel implementation, and
  2. (b)inherently serial parts of the program.
Overheads include communication and synchronisation as well as any extra setup required to allow parallelism. Such overheads depend on the efficiency of the compiler and operating system libraries and the underlying hardware. The impact on performance of inherently serial fractions of a program is explained theoretically (i.e., assuming an idealised system in which overheads are zero) by Amdahl's law. Amdahl's law places an upper bound on the speedup of a parallel program with a given inherently serial fraction. If r is the parallelizable fraction of a program and s=1-r is the inherently serial fraction then the speedup using n sub-tasks, Sn, satisfies the following:
S n 1 s+ r n  
Thus, for example, this says that a program with a serial fraction of one quarter can only ever achieve a speedup of 4 since as n, Sn4.
Parallelism may be utilised on two classes of systems: shared memory and distributed memory machines, which require different programming techniques. Distributed memory machines are composed of processors located in multiple components which each have their own memory space and are connected by a network. Communication and synchronisation between these components is explicit. Shared memory machines have multiple processors (or a single multicore processor) which can all access the same memory space, and this shared memory is used for communication and synchronisation. The NAG Library makes use of shared memory parallelism using OpenMP as described in Section 2.2.
Parallel programs which use OpenMP create (or "fork") a number of threads from a single process when required at run-time. (Programs which make use of shared memory parallelism are also called multithreaded programs.) The threads form a team comprising of a single master thread and a number of slave threads. These threads are capable of executing program instructions independently of one another in parallel. Once the parallel work has been completed the slave threads return control to the master thread and become inactive (or "join") until the next parallel region of work. The threads share the same memory address space, i.e., that of the parent process, and this shared memory is used for communication and synchronisation. OpenMP provides some mechanisms for access control so that, as well as allowing all threads to access shared variables, it is possible for each thread to have private copies of other variables that only it can access. Threads in a team can create their own parallel regions within the current parallel region. At this next level of parallelism, the thread creating the new team becomes the master thread of that team. We call this nested parallelism.
Something to be aware of for multithreaded programs, compared to serial ones, is that identical results cannot be guaranteed, nor should be expected. Identical results are often impossible in a parallel program since using different numbers of threads may cause floating-point arithmetic to be evaluated in a different (but equally valid) order, thus changing the accumulation of rounding errors. For a more in-depth discussion of reproducibility of results see Section 8 in How to Use the NAG Library.

2.2 How is Parallelism Used in the NAG Library?

The multithreaded implementations differ from the serial implementations of the NAG Library in that it makes use of multithreading through use of OpenMP, which is a portable specification for shared memory programming that is available in many different compilers on a wide range of different hardware platforms (see The OpenMP API Specification for Parallel Programming).
Note that not all functions are parallelized; you should check Section 8 of the function documents to find details about parallelism and performance of functions of interest.
There are two situations in which a call to a function in the NAG Library makes use of multithreading:
  1. 1.The function being called is a NAG-specific function that has been threaded using OpenMP, or that internally calls another NAG-specific function that is threaded. This applies to multithreaded implementations of the NAG Library only.
  2. 2.The function being called calls through to BLAS or LAPACK functions. The vendor library recommended for use with your implementation of the NAG Library (whether the NAG Library is threaded or not) may be threaded. Please consult the Users' Note for further information.
A complete list of all the functions in the NAG Library, and their threaded status is given in Section 3.
It is useful to understand how OpenMP is used within the Library in order to avoid the potential pitfalls which lead to making inefficient use of the Library.
A call to a threaded NAG-specific function may, depending on input and at one or more points during execution, use OpenMP to create a team of threads for a parallel region of work. The team of threads will fork at the start of the parallel region before joining at the end of the parallel region. Both the fork and the join will happen internally within the function call. However, there are situations in which the teams of threads may be made available to OpenMP directives in your code via user-supplied subprograms, we refer to directives not contained within a parallel region as orphaned directives. (See Section 8 of the function documents for further information.) Furthermore, OpenMP constructs within NAG functions are executed by teams of threads created within the NAG code, that is, there are no orphaned directives in the Library itself. Throughout this documentation we assume the use of the recommended compiler as given in the Users' Note, and in particular the use of a single OpenMP run-time library. Thus all OpenMP environment variables will apply to your own code and to NAG functions. However, they may not be respected by vendor libraries that have a mechanism for overriding them. NAG provides functions in Chapter X06 to control threads for your whole program, including any specific to a vendor library being called by NAG. You should take care when calling these NAG functions from within your own parallel regions, since if nested parallelism is enabled (it is disabled by default) the NAG function will fork-and-join a team of threads for each calling thread, which may lead to contention on system resources and very poor performance. Poor performance due to contention can also occur if the number of threads requested exceeds the number of physical cores in your machine, or if some hardware resources are busy executing other processes (which may belong to other users in a shared system). For these reasons you should be aware of the number of physical cores available to your program on your machine, and use this information in selecting a number of threads which minimizes contention on resources. Please read the Users' Note for advice about setting the number of threads to use, or contact the NAG Technical Support Service for advice.
If you are calling multithreaded NAG functions from within another threading mechanism you need to be aware of whether or not this threading mechanism is compatible with the OpenMP compiler runtime used to build the multithreaded implementation of the NAG Library on your platform(s) of choice. The Users' Note document for each of the implementations in question will include some guidance on this, and you should contact NAG for further advice if required.
Parallelism is used in many places throughout the NAG Library since, although many functions have not been the focus of parallel development by NAG, they may benefit by calling functions that have, and/or by calling parallel vendor functions (e.g., BLAS, LAPACK). Thus, the performance improvement due to multithreading, if any, will vary depending upon which function is called, problem sizes and other parameters, system design and operating system configuration. If you frequently call a function with similar data sizes and other parameters, it may be worthwhile to experiment with different numbers of threads to determine the choice that gives optimal performance. Please contact NAG for further advice if required.
As a general guide, many key functions in the following areas are known to benefit from shared memory parallelism:

3 Multithreaded Functions

Many functions are threaded using OpenMP in multithreaded implementations of the NAG Library. These implementations are denoted by having a product code of the form 'NS_______', rather than 'NL_______' for serial NAG Library implementations. Please consult Section 8 of each function document for further information.
The documentation search facility may be used to return lists of those functions that have been threaded using OpenMP in NAG code in multithreaded NAG Library implementations or via vendor BLAS or LAPACK implementations. You may use the keywords smp, nagsmp or lapacksmp optionally combined with the keywords =fl or =cl. For example search.html?q=smp gives a full list, or search.html?q=nagsmp+=cl lists the functions available via the CL interface which use OpenMP in NAG code in suitable implementations.
The lists returned include functions which internally call BLAS or LAPACK routines, which may be threaded within the vendor library used by both serial and multithreaded NAG Library implementations. You are advised to consult the documentation for the vendor library for further information. Please consult the Users' Note for your implementation for any additional implementation-specific information.

4 References

The OpenMP API Specification for Parallel Programming