NAG FL Interface
Multithreading
1
Thread Safety
In multithreaded applications, each thread in a team processes instructions independently while sharing the same memory address space. For these applications to operate correctly any routines called from them must be thread safe. That is, any global variables they contain are guaranteed not to be accessed simultaneously by different threads, as this can compromise results. This can be ensured through appropriate synchronization, such as that found in OpenMP.
When a routine is described as thread safe we are considering its behaviour when it is
called by multiple threads. It is worth noting that a thread unsafe routine can still, itself, be multithreaded. A team of threads can be created inside the routine to share the workload as described in
Section 2.
Most routines in the NAG FL Interface are thread safe, however there remain some older routines, listed in the document
Thread Unsafe Routines, that are not thread safe as they use unsynchronised global variables (such as module variables, common blocks or variables with the SAVE attribute). These routines should not be called by multiple threads in a user program. Please consult Section 8 of each routine document for further information.
In the NAG FL Interface there are some pairs of routines which share the same five character root name, for example, the routines
e04ucf/e04uca. Each routine in the pair has exactly the same functionality, except that one of them has additional parameters in order to make it thread safe. The thread safe routine has a different last character in the name in place of the usual character (typically ‘a’ instead of ‘f’). Such pairs are documented in a single routine document and are listed in the individual Chapter Contents.
1.1
Routines with Routine Arguments
Some Library routines require you to supply a routine and to pass the name of the routine as an actual argument in the call to the Library routine. For many of these Library routines, the supplied routine interface includes
array arguments (called iuser and ruser)
specifically for you to pass information to the supplied routine without the need for global variables.
In the NAG FL Interface, if the interfaces of a pair of thread safe (ending ‘a’) and non-thread safe (ending ‘f’) routines contain a user-supplied routine argument then the ‘a’ routine will contain the additional array arguments iuser and ruser (possibly plus others for internal use). In some cases the ‘a’ routine may need to be initialized by a separate initialization routine; this requirement will be clearly documented.
From Mark 26.1 added routines with routine arguments will also contain the argument
cpuser which is of Type (c_ptr) (from the iso_c_binding module). This allows more complicated data structures to be passed easily to the user-supplied routine in cases where
iuser and
ruser would be inconvenient. The following code fragment shows how this can be used.
Module mymodule
Use iso_c_binding, Only: c_f_pointer, c_ptr
Private
Public :: myfun
Type, Public :: mydata
Integer :: nx
Real (Kind=nag_wp), Allocatable :: x(:)
End Type mydata
Contains
Subroutine myfun(...,iuser,ruser,cpuser)
Type (c_ptr), Intent (In) :: cpuser
Type (mydata), Pointer :: md
Call c_f_pointer(cpuser,md)
... Use md%x and md%nx ...
End Subroutine myfun
End Module mymodule
...
Program myprog
Use mymodule, Only: myfun,mydata
Use iso_c_binding, Only: c_loc, c_ptr
Type (c_ptr) :: cpuser
Type (mydata), Target :: md
...
md%nx = 1000
Allocate (md%x(md%nx))
cpuser = c_loc(md)
...
call nagroutine(...,myfun,cpuser,iuser,ruser,ifail)
...
End Program
This mechanism is used, for example, in
Section 10 in
e04stf.
If you need to provide your supplied routine with more information than can be given via the interface argument list, then you are advised to check, in the relevant Chapter Introduction, whether the Library routine you intend to call has an equivalent reverse communication interface. These have been designed specifically for problems where user-supplied routine interfaces are not flexible enough for a given problem, and their use should eliminate the need to provide data through global variables. Where reverse communication interfaces are not available, it is usual to use global variables containing the required data that is accessible from both the supplied routine and from the calling program. It is thread safe to do this only if any global data referenced is made threadprivate by OpenMP or is updated using appropriate synchronisation, thus avoiding the possibility of simultaneous modification by different threads.
Thread safety of user-supplied routines is also an issue with a number of routines in multi-threaded implementations of the NAG Library, which may internally parallelize around the calls to the user-supplied routines. This issue affects not just global variables but also how the
iuser and
ruser arrays, and any data structures pointed to by
cpuser,
may be used. In these cases, synchronisation may be needed to ensure thread safety.
Chapter X06 provides routines which can be used in your supplied routine to determine whether it is being called from within an OpenMP parallel region. If you are in doubt over the thread safety of your program you are advised to contact
NAG for assistance.
1.2
Input/Output
The Library contains routines for setting the current error and advisory message unit numbers (
x04aaf and
x04abf). These routines use the SAVE statement to retain the values of the current unit numbers between calls. It is therefore not advisable for different threads of a multithreaded program to set the message unit numbers to different values. A consequence of this is that error or advisory messages output simultaneously may become garbled, and in any event there is no indication of which thread produces which message. You are therefore advised always to select the ‘soft failure’ mechanism without any error message (
, see
Section 4 in the Introduction to the NAG Library FL Interface) on entry to each NAG Library routine called from a multithreaded application; it is then essential that the value of
ifail be tested on return to the application.
1.3
Implementation Issues
In very rare cases we are unable to guarantee the thread safety of a particular specific implementation. Note also that in some implementations, the Library is linked with one or more vendor libraries to provide, for example, efficient BLAS functions. NAG cannot guarantee that any such vendor library is thread safe. Please consult the
Users' Note for your implementation for any additional implementation-specific information.
2
Parallelism
2.1
Introduction
The time taken to execute a routine from the NAG Library has traditionally depended, to a large degree, on the serial performance capabilities of the processor being used. In an effort to go beyond the performance limitations of a single core processor, multithreaded implementations of the NAG Library are available. These implementations divide the computational workload of some routines between multiple cores and executes these tasks in parallel. Traditionally, such systems consisted of a small number of processors each with a single core. Improvements in the performance capabilities of these processors happened in line with increases in clock frequencies. However, this increase reached a limit which meant that processor designers had to find another way in which to improve performance; this led to the development of multicore processors, which are now ubiquitous. Instead of consisting of a single compute core, multicore processors consist of two or more, which typically comprise at least a Central Processing Unit and a small cache. Thus making effective use of parallelism, wherever possible, has become imperative in order to maximize the performance potential of modern hardware resources, and the multithreaded implementations.
The effectiveness of parallelism can be measured by how much faster a parallel program is compared to an equivalent serial program. This is called the parallel speedup. If a serial program has been parallelized then the speedup of the parallel implementation of the program is defined by dividing the time taken by the original serial program on a given problem by the time taken by the parallel program using cores to compute the same problem. Ideal speedup is obtained when this value is (i.e., when the parallel program takes th the time of the original serial program). If speedup of the parallel program is close to ideal for increasing values of then we say the program has good scalability.
The scalability of a parallel program may be less than the ideal value because of two factors:
-
(a)the overheads introduced as part of the parallel implementation, and
-
(b)inherently serial parts of the program.
Overheads include communication and synchronisation as well as any extra setup required to allow parallelism. Such overheads depend on the efficiency of the compiler and operating system libraries and the underlying hardware. The impact on performance of inherently serial fractions of a program is explained theoretically (i.e., assuming an idealised system in which overheads are zero) by
Amdahl's law. Amdahl's law places an upper bound on the speedup of a parallel program with a given inherently serial fraction. If
is the parallelizable fraction of a program and
is the inherently serial fraction then the speedup using
sub-tasks,
, satisfies the following:
Thus, for example, this says that a program with a serial fraction of one quarter can only ever achieve a speedup of 4 since as , .
Parallelism may be utilised on two classes of systems: shared memory and distributed memory machines, which require different programming techniques. Distributed memory machines are composed of processors located in multiple components which each have their own memory space and are connected by a network. Communication and synchronisation between these components is explicit. Shared memory machines have multiple processors (or a single multicore processor) which can all access the same memory space, and this shared memory is used for communication and synchronisation. The NAG Library makes use of shared memory parallelism using OpenMP as described in
Section 2.2.
Parallel programs which use OpenMP create (or "fork") a number of threads from a single process when required at run-time. (Programs which make use of shared memory parallelism are also called multithreaded programs.) The threads form a team comprising of a single master thread and a number of slave threads. These threads are capable of executing program instructions independently of one another in parallel. Once the parallel work has been completed the slave threads return control to the master thread and become inactive (or "join") until the next parallel region of work. The threads share the same memory address space, i.e., that of the parent process, and this shared memory is used for communication and synchronisation. OpenMP provides some mechanisms for access control so that, as well as allowing all threads to access shared variables, it is possible for each thread to have private copies of other variables that only it can access. Threads in a team can create their own parallel regions within the current parallel region. At this next level of parallelism, the thread creating the new team becomes the master thread of that team. We call this nested parallelism.
Something to be aware of for multithreaded programs, compared to serial ones, is that identical results cannot be guaranteed, nor should be expected. Identical results are often impossible in a parallel program since using different numbers of threads may cause floating-point arithmetic to be evaluated in a different (but equally valid) order, thus changing the accumulation of rounding errors. For a more in-depth discussion of reproducibility of results see
Section 8 in How to Use the NAG Library.
2.2
How is Parallelism Used in the NAG Library?
The multithreaded implementations differ from the serial implementations of the NAG Library in that it makes use of multithreading through use of OpenMP, which is a portable specification for shared memory programming that is available in many different compilers on a wide range of different hardware platforms (see
The OpenMP API Specification for Parallel Programming).
Note that not all routines are parallelized; you should check Section 8 of the routine documents to find details about parallelism and performance of routines of interest.
There are two situations in which a call to a routine in the NAG Library makes use of multithreading:
-
1.The routine being called is a NAG-specific routine that has been threaded using OpenMP, or that internally calls another NAG-specific routine that is threaded. This applies to multithreaded implementations of the NAG Library only.
-
2.The routine being called calls through to BLAS or LAPACK routines. The vendor library recommended for use with your implementation of the NAG Library (whether the NAG Library is threaded or not) may be threaded. Please consult the Users' Note for further information.
A complete list of all the routines in the NAG Library, and their threaded status is given in
Section 3.
It is useful to understand how OpenMP is used within the Library in order to avoid the potential pitfalls which lead to making inefficient use of the Library.
A call to a threaded NAG-specific routine may, depending on input and at one or more points during execution, use OpenMP to create a team of threads for a parallel region of work. The team of threads will fork at the start of the parallel region before joining at the end of the parallel region. Both the fork and the join will happen internally within the routine call. However, there are situations in which the teams of threads may be made available to OpenMP directives in your code via user-supplied subprograms, we refer to directives not contained within a parallel region as
orphaned directives. (See Section 8 of the routine documents for further information.) Furthermore, OpenMP constructs within NAG routines are executed by teams of threads created within the NAG code, that is, there are no orphaned directives in the Library itself. Throughout this documentation we assume the use of the recommended compiler as given in the
Users' Note, and in particular the use of a single OpenMP run-time library. Thus all OpenMP environment variables will apply to your own code and to NAG routines. However, they may not be respected by vendor libraries that have a mechanism for overriding them. NAG provides routines in
Chapter X06 to control threads for your
whole program, including any specific to a vendor library being called by NAG. You should take care when calling these NAG routines from within your own parallel regions, since if nested parallelism is enabled (it is disabled by default) the NAG routine will fork-and-join a team of threads for each calling thread, which may lead to contention on system resources and very poor performance. Poor performance due to contention can also occur if the number of threads requested exceeds the number of physical cores in your machine, or if some hardware resources are busy executing other processes (which may belong to other users in a shared system). For these reasons you should be aware of the number of physical cores available to your program on your machine, and use this information in selecting a number of threads which minimizes contention on resources. Please read the
Users' Note for advice about setting the number of threads to use, or contact the
NAG Technical Support Service
for advice.
If you are calling multithreaded NAG routines from within another threading mechanism you need to be aware of whether or not this threading mechanism is compatible with the OpenMP compiler runtime used to build the multithreaded implementation of the NAG Library on your platform(s) of choice. The
Users' Note document for each of the implementations in question will include some guidance on this, and you should contact
NAG for further advice if required.
Parallelism is used in many places throughout the NAG Library since, although many routines have not been the focus of parallel development by NAG, they may benefit by calling routines that have, and/or by calling parallel vendor routines (e.g., BLAS, LAPACK). Thus, the performance improvement due to multithreading, if any, will vary depending upon which routine is called, problem sizes and other parameters, system design and operating system configuration. If you frequently call a routine with similar data sizes and other parameters, it may be worthwhile to experiment with different numbers of threads to determine the choice that gives optimal performance. Please contact
NAG for further advice if required.
As a general guide, many key routines in the following areas are known to benefit from shared memory parallelism:
- Dense and Sparse Linear Algebra
- FFTs
- Random Number Generators
- Quadrature
- Partial Differential Equations
- Interpolation
- Curve and Surface Fitting
- Correlation and Regression Analysis
- Multivariate Methods
- Time Series Analysis
- Financial Option Pricing
- Global Optimization
- Wavelets
3
Multithreaded Routines
Many routines are threaded using OpenMP in multithreaded implementations of the NAG Library.
These implementations are denoted by having a product code of the form 'NS_______', rather than 'NL_______' for serial NAG Library implementations.
Please consult Section 8 of each
routine document for further information.
The documentation search facility may be used to return lists of those routines that have been threaded using OpenMP in NAG code in multithreaded NAG Library implementations or via vendor BLAS or LAPACK implementations. You may use the keywords
smp,
nagsmp or
lapacksmp optionally combined with the keywords
=fl or
=cl.
For example
search.html?q=smp gives a full list, or
search.html?q=nagsmp+=fl lists the routines available via the FL interface
which use OpenMP in NAG code in suitable implementations.
The lists returned include routines which internally call BLAS or LAPACK routines, which may be threaded within the vendor library used by both serial and multithreaded NAG Library implementations. You are advised to consult the documentation for the vendor library for further information. Please consult the
Users' Note for your implementation for any additional implementation-specific information.
4
References