![]() |
EPFLSTILOBMarcel LeuteneggerAbout MATLAB toolbox C-MEX wrapper performance hints |
---|
MATLAB toolbox
Double class
Extended class
Single class
Flex99-12C correlator
Flex02-0xD correlator
Vector functions
MEX-Builder
C-MEX wrapper
FPU instructions
The following code shows a typical prologue in a C-MEX file implementing an elementary function:
#include "mex.h"
void mexFunction(int nlhs, mxArray* plhs[],
int nrhs, const mxArray* prhs[])
{
// ... argument checking ...
plhs[0]=mxCreateNumericArray(
mxGetNumberOfDimensions(prhs[0]),
mxGetDimensions(prhs[0]),
mxDOUBLE_CLASS,
mxIsComplex(prhs[0]));
// ... computation ...
}
The function first checks the type and dimensions of all passed input arguments. Next, it creates for each output argument a new mxArray
to store the result. Finally, the computation is executed.
By default, all mxArray
creation functions initialize the data elements to zero before returning the array. This guarantees a well-defined state of all data elements not overwritten during computation. Unfortunately, if the function overwrites all data elements, the previous initialization does nothing than waste processing resources.
At first glance, the data initialization consumes only little processing resources. This is effectively true for small output arrays with only a few data elements.
But for medium and especially for large sized arrays, the initialization voids the processor cache and blocks the system memory bus during cache flush. Hence, the data initialization tends to wipe previous results that are potential inputs to the function and delays the read-in of the input arguments.
Modern processors as a Pentium 4 have a theoretical peak memory-to-cache transfer rate of about 4GByte/s. This transfer rate is seldom achieved due to alternate reads and writes implying bus delays on each alternation. But a 2GHz processor has a floating-point multiplication throughput of about 1GFlops/s - hence an internal data throughput of about 24GByte/s (two double
reads & one write per multiplication). It is clear that memory transfers need to be minimized in order to achieve top performance.
I found that false initialization consumes up to 60% of the overall execution time for a sequence of functions called with intermediate results as input arguments. This is nearly as disastrous as to do calculations element-wise (in loops) instead of vectorizing the problem. In general, to achieve top performance, we should
This example shows the C-MEX wrapper file for the vector length function vabs
:
#include "mex.h"
#define L plhs[0]
#define V prhs[0]
void fvabs(double* oPr, const double* vPr,
const double* vPi, int n);
void mexFunction(int nlhs, mxArray* plhs[],
int nrhs, const mxArray* prhs[])
{ int n;
if (nlhs > 1) mexErrMsgTxt("Too many output arguments.");
switch(nrhs)
{default:
mexErrMsgTxt("Incorrect number of arguments.");
case 0:
mexPrintf("\nFast vector length.\n\n\t© 15.1.2004\n\n");
break;
case 1:
[1] n=mxGetNumberOfElements(V);
[2] L=mxCreateDoubleMatrix(0,0,mxREAL);
if (n)
{ double* pr;
const int* d=mxGetDimensions(V);
if (*d != 3) mexErrMsgTxt("Incompatible dimensions.");
n/=3;
[3] mxSetDimensions(L,d,mxGetNumberOfDimensions(V));
[4] mxSetPr(L,pr=mxMalloc(n*sizeof(double)));
mxSetM(L,1);
fvabs(pr,mxGetPr(V),mxGetPi(V),n);
}
}
}
mxArray
creation functions don't allocate memory for the data elements if the array is empty, i.e. the pointers on the real and imaginary parts are both set to null
. This is a convenient way to avoid extra, time-consuming allocations. The wrapper function terminates here if the input array is empty.mxMalloc
instead of mxCalloc
. The fvabs
routine will set the values of all data elements anyway.See also the external API reference in the MATLAB documentation.
© 2011 École Polytechnique Fédérale de Lausanne