Thread-local arrays in cython's prange without huge memory allocation
Thread-local arrays in cython's prange without huge memory allocation
I have some independent computations I would like to do in parallel using Cython.
Right now I'm using this approach:
import numpy as np
cimport numpy as cnp
from cython.parallel import prange
[...]
cdef cnp.ndarray[cnp.float64_t, ndim=2] temporary_variable =
np.zeros((INPUT_SIZE, RESULT_SIZE), np.float64)
cdef cnp.ndarray[cnp.float64_t, ndim=2] result =
np.zeros((INPUT_SIZE, RESULT_SIZE), np.float64)
for i in prange(INPUT_SIZE, nogil=True):
for j in range(RESULT_SIZE):
[...]
temporary_variable[i, j] = some_very_heavy_mathematics(my_input_array)
result[i, j] = some_more_maths(temporary_variable[i, j])
This methodology works but my problem comes from the fact that I in fact need several temporary_variable
s. This results in huge memory usage when INPUT_SIZE
grows. But I believe what is really needed is a temporary variable in each thread instead.
temporary_variable
INPUT_SIZE
Am I facing a limitation of Cython's prange and do I need to learn proper C or am I doing/understanding something terribly wrong?
EDIT: The functions I was looking for were openmp.omp_get_max_threads()
and openmp.omp_get_thread_num()
to create a reasonably sized temporary array. I had to cimport openmp
first.
openmp.omp_get_max_threads()
openmp.omp_get_thread_num()
cimport openmp
@DavidW Thanks for your help. I should probably split my code into smaller functions because I need arrays. I'm struggling to figure out how to do so unfortunately.
– nicoco
Aug 31 at 8:19
I'll try to write a note complete answer in the next few days but my suggestion was that if the two lines shown (
temp_var = ...
and some_more_maths(temp_var)
) are contained in a function then the variable is local to the function (so definitely thread local)– DavidW
Aug 31 at 18:09
temp_var = ...
some_more_maths(temp_var)
1 Answer
1
This is something that Cython tries to detect, and actually gets right most of the time. If we take a more complete example code:
import numpy as np
from cython.parallel import prange
cdef double f1(double[:,:] x, int i, int j) nogil:
return 2*x[i,j]
cdef double f2(double y) nogil:
return y+10
def example_function(double[:,:] arr_in):
cdef double[:,:] result = np.zeros(arr_in.shape)
cdef double temporary_variable
cdef int i,j
for i in prange(arr_in.shape[0], nogil=True):
for j in range(arr_in.shape[1]):
temporary_variable = f1(arr_in,i,j)
result[i,j] = f2(temporary_variable)
return result
(this is basically the same as yours, but compilable). This compiles to the C code:
#pragma omp for firstprivate(__pyx_v_i) lastprivate(__pyx_v_i) lastprivate(__pyx_v_j) lastprivate(__pyx_v_temporary_variable)
#endif /* _OPENMP */
for (__pyx_t_8 = 0; __pyx_t_8 < __pyx_t_9; __pyx_t_8++){
You can see that temporary_variable
is set to be thread-local. If Cython does not detect this correctly (I find it's often too keen to make variables a reduction) then my suggestion is to encapsulate (some of) the contents of the loop in a function:
temporary_variable
cdef double loop_contents(double[:,:] arr_in, int i, int j) nogil:
cdef double temporary_variable
temporary_variable = f1(arr_in,i,j)
return f2(temporary_variable)
Doing so forces temporary_variable
to be local to the function (and hence to the thread)
temporary_variable
With respect to creating a thread-local array: I'm not 100% clear exactly what you want to do but I'll try to take a guess...
malloc
free
The easiest way is to allocate a 2D array where you have one column for each thread. The array is shared, but since each thread only touches its own column that doesn't matter. A simple example:
cdef double[:] f1(double[:,:] x, int i) nogil:
return x[i,:]
def example_function(double[:,:] arr_in):
cdef double[:,:] temporary_variable = np.zeros((arr_in.shape[1],openmp.omp_get_max_threads()))
cdef int i
for i in prange(arr_in.shape[0],nogil=True):
temporary_variable[:,openmp.omp_get_thread_num()] = f1(arr_in,i)
Thanks a lot for your detailed answer. However, I still don't understand how to make
temporary_variable
an thread-local array (see the edit on my post). Maybe this is not something that can be done in cython and I need to refactor my code in order to avoid needing thread-local arrays.– nicoco
Sep 3 at 8:40
temporary_variable
I think the edit should be what you want.
– DavidW
Sep 3 at 17:59
Thanks again. #3 is what I was already doing; the problem is that it requires a huge amount of RAM for large inputs. I guess #2 is what I need to do, but I need to improve my C skills first. Right now, I just gave up on parallelism for this specific case and well, it gives me an excuse to hang out on SO while waiting for my results. :o)
– nicoco
Sep 4 at 8:01
It isn't the same as what you show in the question. You create an array that's input_size x result_size. I create an array that's input_size x number_of_threads. number_of_threads is typically reasonably small (4 or 8?).
– DavidW
Sep 4 at 8:28
Oh sorry I missed that. I think that is exactly what I was looking for. I'll try it ASAP. Thank you very much.
– nicoco
Sep 4 at 11:09
Required, but never shown
Required, but never shown
By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.
Cython generally assigns thread locals correctly (if you just make it a scalar rather than array). Failing take, see if you can put the loop body in a separate function with its own local variables
– DavidW
Aug 30 at 21:24