Make row/colscale, select_bitmap more memory-friendly for CUDA #406

VidithM · 2025-03-07T06:58:31Z

Defer memcpys from the input to output matrix in row/colscale, select_bitmap until we know whether we are using CUDA. If so, do the memcpy inside the kernel.

DrTimothyAldenDavis · 2025-03-07T13:23:21Z

Oops... my last comment (just deleted) was a reference to LAGraph, not GraphBLAS ... I got mixed up on which package I was looking at.

CUDA/template/GB_jit_kernel_cuda_colscale.cu

DrTimothyAldenDavis · 2025-03-07T13:32:03Z

Source/mxm/GB_colscale.c

+        int64_t *tmp_Ah = A->h ;
+        A->p = NULL ;
+        A->h = NULL ;
+        GB_OK (GB_dup_worker (&C, C_iso, A, false, ztype)) ;


I hadn't considered this option for using GB_dup_worker but I see from my code that it can work. However, it might be cleaner to revise GB_dup_worker, and pass in flags that disable the copy of specific parts of the matrix, like "bool Ap_dup" or something, for each component. I can pull in this PR for now, but a "// FIXME: revise this to use flags for each component" can be added here.

That makes sense. It looks like this wouldn't be a huge change, I can make a follow-up PR for this.

DrTimothyAldenDavis · 2025-03-07T13:35:42Z

Looks great overall, just some minor comments and tweaks above.

DrTimothyAldenDavis · 2025-03-07T13:39:45Z

GB_cuda_ek_slice needs some work. It assumes the input matrix is always the "A" matrix, and it has a specific type for the Ap array it's using: GB_Ap_TYPE. It's not GB_Aj_TYPE so that will break if Ap and Aj have different types. That's one reason why I'm still using plain int64_t for scalars, like the return value of GB_cuda_ek_slice_entry. We probably should use C++ templates here so I can call GB_cuda_ek_slice for other matrices, like "B".

DrTimothyAldenDavis · 2025-03-07T13:41:34Z

I see CUDA/template/GB_cuda_jit_AxB_dot3_dense_phase1.cuh is broken. I'm calling the ek_slice methods on the M matrix but internally I use the "A" matrix integers.

VidithM · 2025-03-07T19:01:45Z

This should be good now. I'll make a follow-up PR to add flags to GB_dup_worker specifying which arrays to dup.

CUDA/template/GB_jit_kernel_cuda_select_sparse.cu

CUDA/template/GB_jit_kernel_cuda_colscale.cu

CUDA/template/GB_jit_kernel_cuda_rowscale.cu

DrTimothyAldenDavis · 2025-03-07T22:35:57Z

Source/mxm/GB_colscale.c

+
+            if (A->p != NULL)
+            {
+                GB_memcpy (C->p, A->p, (anvec+1) * psize, nthreads_max) ;


We need to assert that C->p and A->p have the same integer type. Likewise C->h and A->h.

Fixed - this is already known to be true from the results of GB_dup_worker, and we use the same assumption elsewhere, e.g. in the kernels. Should the same assert be there too?

Source/mxm/GB_rowscale.c

VidithM · 2025-03-10T19:02:26Z

Addressed feedback and synced with the latest dev2 changes

Make row/colscale, select_bitmap more memory-friendly for CUDA

4a03d25