Tuesday, January 7, 2020

Plotting Lots of Data in Matlab

This post will describe some code that I've written to make plotting of large datasets, specifically line plots, much faster in Matlab. This code is provided on Github and the Mathworks File Exchange 

Introduction - What does this code do?

Plotting large amounts of data can be painful in Matlab. I have not yet made an attempt to quantify large, but this generally consists of array sizes that are in the tens to hundreds of millions of data points (or larger), with plotting delays in the range of a few seconds to many 10s of seconds. Zooming in on subsets feels even more painful given the general expectation of nearly instantaneous response to manipulation of graphics objects, coupled with the reality of waiting for many seconds for the plot to rerender.

For a while now, I've been aware of a workaround to speed up plotting, which is to only plot a subset of the actual data, such that it looks the same. When the user zooms, the plotting function needs to redraw so that the illusion of sameness remains the same. The subset that is drawn are local minima and maxima. The idea is that if you take a line plot, and squish it horizontally, you can't see the intermediate vertical back and forths of the data, just the extremes. Put another way, imagine a plot of a sine wave. If the plot is narrowed, you can only seen the maximum and minimum values, not wave itself.

The sum of two sine waves is plotted in the figure below. The left plot consists of 100 million data points. The right plots shows local minima and maxima, and plots in a fraction of the time.

Left: Normal. Right: plotBig code

Zooming in (without redrawing) shows the difference between the two plotting approaches. In the left plot, I've zoomed in on the normal plot, and the faster sine wave can be seen. Again, this process is relatively slow. In the middle plot, I've zoomed in on the plot which only showed a subset of the data. Without updating the plotted data, the min/max sampling can be seen. This plot m ay appear strange and actually highlights a couple optimizations that I made (to be discussed in the later optimizations section). Finally, the third plot shows the subset plotting code with redrawing enabled. Again, it looks the same as the left most plot.

Left: Zoom in on normal plot. Middle: Zoom in on plotBig without a redraw. Right: Zoom in on plotBig with a redraw.

Doesn't this code exist already?

When I started writing this code, I looked at the other options that were available. The best option was the code by Tucker McClure. At the time, multiple plot objects on the same axes were not allowed. There was also a memory leak that could cause problems. Both of these problems appear to be fixed at the time of this writing. There does however remain a bug in the subset sampling as was pointed out by Robbert on the file exchange on Dec 2014.

The biggest impetus for updating Tucker's code was that the size of the data being plotted could easily start to cause memory problems. Most of my plotting involves evenly sampled time data that can be completely described by a start time and a sampling rate. Using an object which contains these parameters halves my memory requirements. Additionally, it makes it very quick to translate from a time to a given index.

Since Tucker's code did not support my time objects, I started updating his code to support these objects. The main changes needed to be in the main function that translates the full data set into the subset of data that should used for plotting (the reduce_to_width() function). As I started looking at that function however, I couldn't help but notice how much of the code could be improved. I thus slowly began rewriting aspects of his code.

Optimizations - The fun part

Initial Optimizations

My initial optimizations focused on trying to keep the essence of Tucker's code, but rewriting particular bottlenecks. These optimizations included 1) the use of the time object, which removed the need to find subset boundaries 2) mex code to find the minima and maxima within a subset (in certain conditions), many thanks to Jan Simon for writing that code and 3) the use of Matlab's min() and max() functions that parallelize quite nicely over multi-dimensional data.

The use of a time object is a very important improvement relative to Tucker's code. His code uses a binary search to find boundaries for a particular subset. This is not very efficient when the data are evenly sampled. Finding boundaries for each subset also turns out not to be necessary (although this was a later optimization).

The use of Matlab's min() and max() also warrants an explanation. The usage of min() and max() in code which calculates the min() and max() of a subset is relatively obvious, however the parallelization calls that these can enable are not. Like many Matlab functions, min() and max() are both multithreaded. Consider the following code, where we need to get the min and max within chunks that are 10 samples each:


for iChunk = 1:n_chunks
   start_sample = starts(iChunk);
   end_samples  = ends(iChunk);
   min_data(iChunk) = min(data(start_sample:end_sample));
   max_data(iChunk) = max(data(start_sample:end_sample));
end

However, if the data are sized right, we can reshape the data and then just compute the max and the min over all subsets with just a couple calls. This has the added advantage that Matlab will use multiple-threads to compute the min and max of each chunk.


data2 = reshape(data,[10 length(data)/10]);
min_data = min(data2,1);
max_data = max(data2,1);

The reshape() call however requires that the length of the data be evenly divisible by 10. In reality the size of each chunk is usually much higher, making it less likely that the length of the data is evenly divisible by the chunk size.

This is where it helps to understand just a bit about how an array is stored in Matlab. An array in Matlab consists of a header which tracks things like its size, whether or not it contains complex data, the type of data it contains (double, single, uint32, etc.) and a pointer to the data. The exact contents of this header are not documented (see this link for more info), however some of these properties can be edited directly by using mex functionality. Using mex, it is possible to to "truncate" the size of the array, which essentially updates a parameter in the array header, without actually doing anything to the underlying data. The actual data is just a series of bytes which requires more information to understand the shape of the array.

Thus, in mex, the array should be truncated so that it is divisible by the desired length. Once this is done, the array can be "reshaped" (again just updating parameters in the header), and Matlab's min() and max() functions can happily be called (from mex) on the "truncated and reshaped" data. Once this is done, the size of the array can be restored (still in mex), so that no problems are caused in working with the data in Matlab. This approach however does not compute the min and the max on the remaining data that was insufficiently long to form a full chunk. Thus a final post-processing step is necessary to compute the min and the max on that final chunk. The mex code is shown below for reference.


    mwSize new_m, new_n, old_m, old_n;
    mxArray *data;
    mxArray *lhs_c1[2];
    mxArray *lhs_c2[2];
    mxArray *in_min_max[3];
    mxArray *empty_input, *dim_input;
    
    data  = prhs[0];
    new_m = (mwSize)mxGetScalar(prhs[1]);
    new_n = (mwSize)mxGetScalar(prhs[2]);
        
    old_m = mxGetM(data);
    old_n = mxGetN(data);
    mxSetM(data,new_m);
    mxSetN(data,new_n);
    
    in_min_max[0] = data;
    empty_input = mxCreateNumericMatrix(0,0,mxDOUBLE_CLASS,mxREAL);
    dim_input = mxCreateDoubleScalar(1);
    in_min_max[1] = empty_input; //Empty
    in_min_max[2] = dim_input; //1
    
    mexCallMATLAB(2, lhs_c1, 3, in_min_max, "min");
    plhs[0] = lhs_c1[0];
    plhs[1] = lhs_c1[1];
    
    mexCallMATLAB(2, lhs_c2, 3, in_min_max, "max");
    plhs[2] = lhs_c2[0];
    plhs[3] = lhs_c2[1];
    
    //Reset the size ...
    mxSetM(prhs[0],old_m);
    mxSetN(prhs[0],old_n);
    
    mxDestroyArray(empty_input);
    mxDestroyArray(dim_input);

This approach worked quite nicely, but it only applies for either 1) a single channel in its entirety or 2) multiple channels that are evenly divisible by the specified length, which again, is unlikely. This approach should work for a subset of a single channel, however that would require updating where the "start" of the data is located, which Matlab detects and disallows (more specifically the call to mxSetData fails).

In these cases in which the resize trick didn't work, I relied on the mex code that Jan Simon wrote in response to me asking the online community for a quick min/max of subsets solution (my question, FEX link).

Optimizations - Take 2

The story would have ended there if there hadn't been a bug in my code. Actually, there were a couple of bugs which eventually were removed, but one in particular was really hard to diagnose, largely due to the code being way to complicated and my errors coming from manual interaction with plots. In the end, the big issue had to do with callbacks not always being called when the x-limits of an axes were changed.

The main optimization was made possible from additional work that I had done on writing a quite fast JSON parser for Matlab, which I intend to post about soon. In the process, I learned how to use OpenMP as a relatively simple approach to parallelize C code. I also became much more comfortable with writing C code. Thus I decided to write code that would compute min and max values over subsets of data, in parallel, regardless of the size of the input data.

Perhaps the most interesting aspect of the OpenMP code is the use of the collapse directive.  For multiple channels, it is desirable to make parallel calls to each channel. However, it is also desirable to parallelize the reduction of each subset to its respective minima and maxima. Parallelizing multiple loop depths can be accomplished with the "collapse" directive. This approach works as long as the code is written such that you can jump into the innermost part of the collapsed loop in any order. What this means is that any indexing that is dependent upon loop variables needs to be computed based on the looping variables, and not based on assuming that you've added an offset from the previous loop. This is illustrated in the code below.


#pragma omp parallel for simd collapse(2)
for (mwSize iChan = 0; iChan < n_chans; iChan++){
    for (mwSize iChunk = 0; iChunk < n_chunks; iChunk++){
  
   double *current_data_point = p_start_data + n_samples_data*iChan + ...
   double *local_output_data = p_output_data + n_outputs*iChan + 2*iChunk;
   // ... Third loop which 
   // loops over samples in the subset 
   // and assigns min and max to output
 

I also realized that as long as the data were sufficiently sampled 1) it was not necessary to keep track of which came first, the min or the max, and 2) that the times of these events could be well approximated by a linear sampling in time from the start of the data to the end of data, rather than keeping track of the exact time of the min and max events. This explains the strange shape that is seen in the 2nd figure above when zooming in on the data without redrawing (that being the min and max are out of order, and that they are not plotted at their actual locations.

The Callback Debacle

In the old version I also had a pretty terrible system for trying to delay redraw callbacks. The reason for wanting to delay callbacks is that Matlab will occasionally toggle the axis limits, and it doesn't make sense to always redraw the plot if these changes occur in quick succession. The general idea was to have a callback that initiated a timer that would redraw the plot at a later time. However, if another callback occurred, the time was reset and would have to wait just a bit longer. Thus, the redraw would only occur 1) after a callback indicating that the graph needed to be updated and 2) after a sufficient "quiet time" in which no callbacks had occurred. However, this ran into problems when panning since the constant inflow of callbacks meant that a redraw would never occur. So in addition I needed to keep track of how long it had been since the first callback, so that if too much time had elapsed, I would redraw anyways. If this sounds confusing, it was. I have since taken a different an much simpler approach.

Now I attach a timer to the data which periodically checks to see if the axis limits have changed. If the limits have changed, the plot is redrawn. It is still possible to hit redraws in the middle of a plot change, but on average this approach works very well.

Conclusion

What started out as a a few simple optimizations turned into a solid few weeks of work spread out over the course of a couple years. Overall, I'm pretty satisfied with how the final product turned out. I told a fellow post-doc about this project 2 years ago, and he expressed interest and asked me to send me the code when I finished. Well, I can finally check that off of my list!

The approach that I settled on works pretty well, although there are a few things that could potentially be improved. I tried to use SIMD, however the code was much more complex and ran at about the same speed. I think this is most likely since the min and max SIMD instructions only support working with 4 doubles. Given the overhead of calling SIMD, and that I need to call both min and max SIMD calls, without potentially only calling min or max (i.e. if a value is the minimum, then it won't be the maximum), the whole thing is a wash. If the instructions expand from 256 bits (4 doubles) to 512 bits (8 doubles), that story might change. I've also considered trying to use a GPU to improve the code, but this is intimidating as I have never done GPU programming. Perhaps more importantly, I am unclear as to how generalizable this becomes when needing to support a wide range of graphics cards with different levels of support for GPU functions.

No comments:

Post a Comment