Thread-local arrays in cython's prange without huge memory allocation

Thread-local arrays in cython's prange without huge memory allocation



I have some independent computations I would like to do in parallel using Cython.



Right now I'm using this approach:


import numpy as np
cimport numpy as cnp
from cython.parallel import prange

[...]

cdef cnp.ndarray[cnp.float64_t, ndim=2] temporary_variable =
np.zeros((INPUT_SIZE, RESULT_SIZE), np.float64)
cdef cnp.ndarray[cnp.float64_t, ndim=2] result =
np.zeros((INPUT_SIZE, RESULT_SIZE), np.float64)

for i in prange(INPUT_SIZE, nogil=True):
for j in range(RESULT_SIZE):
[...]
temporary_variable[i, j] = some_very_heavy_mathematics(my_input_array)
result[i, j] = some_more_maths(temporary_variable[i, j])



This methodology works but my problem comes from the fact that I in fact need several temporary_variables. This results in huge memory usage when INPUT_SIZE grows. But I believe what is really needed is a temporary variable in each thread instead.


temporary_variable


INPUT_SIZE



Am I facing a limitation of Cython's prange and do I need to learn proper C or am I doing/understanding something terribly wrong?



EDIT: The functions I was looking for were openmp.omp_get_max_threads() and openmp.omp_get_thread_num() to create a reasonably sized temporary array. I had to cimport openmp first.


openmp.omp_get_max_threads()


openmp.omp_get_thread_num()


cimport openmp





Cython generally assigns thread locals correctly (if you just make it a scalar rather than array). Failing take, see if you can put the loop body in a separate function with its own local variables
– DavidW
Aug 30 at 21:24





@DavidW Thanks for your help. I should probably split my code into smaller functions because I need arrays. I'm struggling to figure out how to do so unfortunately.
– nicoco
Aug 31 at 8:19





I'll try to write a note complete answer in the next few days but my suggestion was that if the two lines shown (temp_var = ... and some_more_maths(temp_var)) are contained in a function then the variable is local to the function (so definitely thread local)
– DavidW
Aug 31 at 18:09



temp_var = ...


some_more_maths(temp_var)




1 Answer
1



This is something that Cython tries to detect, and actually gets right most of the time. If we take a more complete example code:


import numpy as np
from cython.parallel import prange

cdef double f1(double[:,:] x, int i, int j) nogil:
return 2*x[i,j]

cdef double f2(double y) nogil:
return y+10

def example_function(double[:,:] arr_in):
cdef double[:,:] result = np.zeros(arr_in.shape)
cdef double temporary_variable
cdef int i,j
for i in prange(arr_in.shape[0], nogil=True):
for j in range(arr_in.shape[1]):
temporary_variable = f1(arr_in,i,j)
result[i,j] = f2(temporary_variable)
return result



(this is basically the same as yours, but compilable). This compiles to the C code:


#pragma omp for firstprivate(__pyx_v_i) lastprivate(__pyx_v_i) lastprivate(__pyx_v_j) lastprivate(__pyx_v_temporary_variable)
#endif /* _OPENMP */
for (__pyx_t_8 = 0; __pyx_t_8 < __pyx_t_9; __pyx_t_8++){



You can see that temporary_variable is set to be thread-local. If Cython does not detect this correctly (I find it's often too keen to make variables a reduction) then my suggestion is to encapsulate (some of) the contents of the loop in a function:


temporary_variable


cdef double loop_contents(double[:,:] arr_in, int i, int j) nogil:
cdef double temporary_variable
temporary_variable = f1(arr_in,i,j)
return f2(temporary_variable)



Doing so forces temporary_variable to be local to the function (and hence to the thread)


temporary_variable



With respect to creating a thread-local array: I'm not 100% clear exactly what you want to do but I'll try to take a guess...


malloc


free



The easiest way is to allocate a 2D array where you have one column for each thread. The array is shared, but since each thread only touches its own column that doesn't matter. A simple example:


cdef double[:] f1(double[:,:] x, int i) nogil:
return x[i,:]

def example_function(double[:,:] arr_in):
cdef double[:,:] temporary_variable = np.zeros((arr_in.shape[1],openmp.omp_get_max_threads()))
cdef int i
for i in prange(arr_in.shape[0],nogil=True):
temporary_variable[:,openmp.omp_get_thread_num()] = f1(arr_in,i)





Thanks a lot for your detailed answer. However, I still don't understand how to make temporary_variable an thread-local array (see the edit on my post). Maybe this is not something that can be done in cython and I need to refactor my code in order to avoid needing thread-local arrays.
– nicoco
Sep 3 at 8:40



temporary_variable





I think the edit should be what you want.
– DavidW
Sep 3 at 17:59





Thanks again. #3 is what I was already doing; the problem is that it requires a huge amount of RAM for large inputs. I guess #2 is what I need to do, but I need to improve my C skills first. Right now, I just gave up on parallelism for this specific case and well, it gives me an excuse to hang out on SO while waiting for my results. :o)
– nicoco
Sep 4 at 8:01





It isn't the same as what you show in the question. You create an array that's input_size x result_size. I create an array that's input_size x number_of_threads. number_of_threads is typically reasonably small (4 or 8?).
– DavidW
Sep 4 at 8:28





Oh sorry I missed that. I think that is exactly what I was looking for. I'll try it ASAP. Thank you very much.
– nicoco
Sep 4 at 11:09



Required, but never shown



Required, but never shown






By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

Popular posts from this blog

𛂒𛀶,𛀽𛀑𛂀𛃧𛂓𛀙𛃆𛃑𛃷𛂟𛁡𛀢𛀟𛁤𛂽𛁕𛁪𛂟𛂯,𛁞𛂧𛀴𛁄𛁠𛁼𛂿𛀤 𛂘,𛁺𛂾𛃭𛃭𛃵𛀺,𛂣𛃍𛂖𛃶 𛀸𛃀𛂖𛁶𛁏𛁚 𛂢𛂞 𛁰𛂆𛀔,𛁸𛀽𛁓𛃋𛂇𛃧𛀧𛃣𛂐𛃇,𛂂𛃻𛃲𛁬𛃞𛀧𛃃𛀅 𛂭𛁠𛁡𛃇𛀷𛃓𛁥,𛁙𛁘𛁞𛃸𛁸𛃣𛁜,𛂛,𛃿,𛁯𛂘𛂌𛃛𛁱𛃌𛂈𛂇 𛁊𛃲,𛀕𛃴𛀜 𛀶𛂆𛀶𛃟𛂉𛀣,𛂐𛁞𛁾 𛁷𛂑𛁳𛂯𛀬𛃅,𛃶𛁼

Crossroads (UK TV series)

ữḛḳṊẴ ẋ,Ẩṙ,ỹḛẪẠứụỿṞṦ,Ṉẍừ,ứ Ị,Ḵ,ṏ ṇỪḎḰṰọửḊ ṾḨḮữẑỶṑỗḮṣṉẃ Ữẩụ,ṓ,ḹẕḪḫỞṿḭ ỒṱṨẁṋṜ ḅẈ ṉ ứṀḱṑỒḵ,ḏ,ḊḖỹẊ Ẻḷổ,ṥ ẔḲẪụḣể Ṱ ḭỏựẶ Ồ Ṩ,ẂḿṡḾồ ỗṗṡịṞẤḵṽẃ ṸḒẄẘ,ủẞẵṦṟầṓế