Tuesday, January 7, 2020

test image post

Plotting Lots of Data in Matlab

This post will describe some code that I've written to make plotting of large datasets, specifically line plots, much faster in Matlab. This code is provided on Github and the Mathworks File Exchange 

Introduction - What does this code do?

Plotting large amounts of data can be painful in Matlab. I have not yet made an attempt to quantify large, but this generally consists of array sizes that are in the tens to hundreds of millions of data points (or larger), with plotting delays in the range of a few seconds to many 10s of seconds. Zooming in on subsets feels even more painful given the general expectation of nearly instantaneous response to manipulation of graphics objects, coupled with the reality of waiting for many seconds for the plot to rerender.

For a while now, I've been aware of a workaround to speed up plotting, which is to only plot a subset of the actual data, such that it looks the same. When the user zooms, the plotting function needs to redraw so that the illusion of sameness remains the same. The subset that is drawn are local minima and maxima. The idea is that if you take a line plot, and squish it horizontally, you can't see the intermediate vertical back and forths of the data, just the extremes. Put another way, imagine a plot of a sine wave. If the plot is narrowed, you can only seen the maximum and minimum values, not wave itself.

The sum of two sine waves is plotted in the figure below. The left plot consists of 100 million data points. The right plots shows local minima and maxima, and plots in a fraction of the time.

Left: Normal. Right: plotBig code

Zooming in (without redrawing) shows the difference between the two plotting approaches. In the left plot, I've zoomed in on the normal plot, and the faster sine wave can be seen. Again, this process is relatively slow. In the middle plot, I've zoomed in on the plot which only showed a subset of the data. Without updating the plotted data, the min/max sampling can be seen. This plot m ay appear strange and actually highlights a couple optimizations that I made (to be discussed in the later optimizations section). Finally, the third plot shows the subset plotting code with redrawing enabled. Again, it looks the same as the left most plot.

Left: Zoom in on normal plot. Middle: Zoom in on plotBig without a redraw. Right: Zoom in on plotBig with a redraw.

Doesn't this code exist already?

When I started writing this code, I looked at the other options that were available. The best option was the code by Tucker McClure. At the time, multiple plot objects on the same axes were not allowed. There was also a memory leak that could cause problems. Both of these problems appear to be fixed at the time of this writing. There does however remain a bug in the subset sampling as was pointed out by Robbert on the file exchange on Dec 2014.

The biggest impetus for updating Tucker's code was that the size of the data being plotted could easily start to cause memory problems. Most of my plotting involves evenly sampled time data that can be completely described by a start time and a sampling rate. Using an object which contains these parameters halves my memory requirements. Additionally, it makes it very quick to translate from a time to a given index.

Since Tucker's code did not support my time objects, I started updating his code to support these objects. The main changes needed to be in the main function that translates the full data set into the subset of data that should used for plotting (the reduce_to_width() function). As I started looking at that function however, I couldn't help but notice how much of the code could be improved. I thus slowly began rewriting aspects of his code.

Optimizations - The fun part

Initial Optimizations

My initial optimizations focused on trying to keep the essence of Tucker's code, but rewriting particular bottlenecks. These optimizations included 1) the use of the time object, which removed the need to find subset boundaries 2) mex code to find the minima and maxima within a subset (in certain conditions), many thanks to Jan Simon for writing that code and 3) the use of Matlab's min() and max() functions that parallelize quite nicely over multi-dimensional data.

The use of a time object is a very important improvement relative to Tucker's code. His code uses a binary search to find boundaries for a particular subset. This is not very efficient when the data are evenly sampled. Finding boundaries for each subset also turns out not to be necessary (although this was a later optimization).

The use of Matlab's min() and max() also warrants an explanation. The usage of min() and max() in code which calculates the min() and max() of a subset is relatively obvious, however the parallelization calls that these can enable are not. Like many Matlab functions, min() and max() are both multithreaded. Consider the following code, where we need to get the min and max within chunks that are 10 samples each:


for iChunk = 1:n_chunks
   start_sample = starts(iChunk);
   end_samples  = ends(iChunk);
   min_data(iChunk) = min(data(start_sample:end_sample));
   max_data(iChunk) = max(data(start_sample:end_sample));
end

However, if the data are sized right, we can reshape the data and then just compute the max and the min over all subsets with just a couple calls. This has the added advantage that Matlab will use multiple-threads to compute the min and max of each chunk.


data2 = reshape(data,[10 length(data)/10]);
min_data = min(data2,1);
max_data = max(data2,1);

The reshape() call however requires that the length of the data be evenly divisible by 10. In reality the size of each chunk is usually much higher, making it less likely that the length of the data is evenly divisible by the chunk size.

This is where it helps to understand just a bit about how an array is stored in Matlab. An array in Matlab consists of a header which tracks things like its size, whether or not it contains complex data, the type of data it contains (double, single, uint32, etc.) and a pointer to the data. The exact contents of this header are not documented (see this link for more info), however some of these properties can be edited directly by using mex functionality. Using mex, it is possible to to "truncate" the size of the array, which essentially updates a parameter in the array header, without actually doing anything to the underlying data. The actual data is just a series of bytes which requires more information to understand the shape of the array.

Thus, in mex, the array should be truncated so that it is divisible by the desired length. Once this is done, the array can be "reshaped" (again just updating parameters in the header), and Matlab's min() and max() functions can happily be called (from mex) on the "truncated and reshaped" data. Once this is done, the size of the array can be restored (still in mex), so that no problems are caused in working with the data in Matlab. This approach however does not compute the min and the max on the remaining data that was insufficiently long to form a full chunk. Thus a final post-processing step is necessary to compute the min and the max on that final chunk. The mex code is shown below for reference.


    mwSize new_m, new_n, old_m, old_n;
    mxArray *data;
    mxArray *lhs_c1[2];
    mxArray *lhs_c2[2];
    mxArray *in_min_max[3];
    mxArray *empty_input, *dim_input;
    
    data  = prhs[0];
    new_m = (mwSize)mxGetScalar(prhs[1]);
    new_n = (mwSize)mxGetScalar(prhs[2]);
        
    old_m = mxGetM(data);
    old_n = mxGetN(data);
    mxSetM(data,new_m);
    mxSetN(data,new_n);
    
    in_min_max[0] = data;
    empty_input = mxCreateNumericMatrix(0,0,mxDOUBLE_CLASS,mxREAL);
    dim_input = mxCreateDoubleScalar(1);
    in_min_max[1] = empty_input; //Empty
    in_min_max[2] = dim_input; //1
    
    mexCallMATLAB(2, lhs_c1, 3, in_min_max, "min");
    plhs[0] = lhs_c1[0];
    plhs[1] = lhs_c1[1];
    
    mexCallMATLAB(2, lhs_c2, 3, in_min_max, "max");
    plhs[2] = lhs_c2[0];
    plhs[3] = lhs_c2[1];
    
    //Reset the size ...
    mxSetM(prhs[0],old_m);
    mxSetN(prhs[0],old_n);
    
    mxDestroyArray(empty_input);
    mxDestroyArray(dim_input);

This approach worked quite nicely, but it only applies for either 1) a single channel in its entirety or 2) multiple channels that are evenly divisible by the specified length, which again, is unlikely. This approach should work for a subset of a single channel, however that would require updating where the "start" of the data is located, which Matlab detects and disallows (more specifically the call to mxSetData fails).

In these cases in which the resize trick didn't work, I relied on the mex code that Jan Simon wrote in response to me asking the online community for a quick min/max of subsets solution (my question, FEX link).

Optimizations - Take 2

The story would have ended there if there hadn't been a bug in my code. Actually, there were a couple of bugs which eventually were removed, but one in particular was really hard to diagnose, largely due to the code being way to complicated and my errors coming from manual interaction with plots. In the end, the big issue had to do with callbacks not always being called when the x-limits of an axes were changed.

The main optimization was made possible from additional work that I had done on writing a quite fast JSON parser for Matlab, which I intend to post about soon. In the process, I learned how to use OpenMP as a relatively simple approach to parallelize C code. I also became much more comfortable with writing C code. Thus I decided to write code that would compute min and max values over subsets of data, in parallel, regardless of the size of the input data.

Perhaps the most interesting aspect of the OpenMP code is the use of the collapse directive.  For multiple channels, it is desirable to make parallel calls to each channel. However, it is also desirable to parallelize the reduction of each subset to its respective minima and maxima. Parallelizing multiple loop depths can be accomplished with the "collapse" directive. This approach works as long as the code is written such that you can jump into the innermost part of the collapsed loop in any order. What this means is that any indexing that is dependent upon loop variables needs to be computed based on the looping variables, and not based on assuming that you've added an offset from the previous loop. This is illustrated in the code below.


#pragma omp parallel for simd collapse(2)
for (mwSize iChan = 0; iChan < n_chans; iChan++){
    for (mwSize iChunk = 0; iChunk < n_chunks; iChunk++){
  
   double *current_data_point = p_start_data + n_samples_data*iChan + ...
   double *local_output_data = p_output_data + n_outputs*iChan + 2*iChunk;
   // ... Third loop which 
   // loops over samples in the subset 
   // and assigns min and max to output
 

I also realized that as long as the data were sufficiently sampled 1) it was not necessary to keep track of which came first, the min or the max, and 2) that the times of these events could be well approximated by a linear sampling in time from the start of the data to the end of data, rather than keeping track of the exact time of the min and max events. This explains the strange shape that is seen in the 2nd figure above when zooming in on the data without redrawing (that being the min and max are out of order, and that they are not plotted at their actual locations.

The Callback Debacle

In the old version I also had a pretty terrible system for trying to delay redraw callbacks. The reason for wanting to delay callbacks is that Matlab will occasionally toggle the axis limits, and it doesn't make sense to always redraw the plot if these changes occur in quick succession. The general idea was to have a callback that initiated a timer that would redraw the plot at a later time. However, if another callback occurred, the time was reset and would have to wait just a bit longer. Thus, the redraw would only occur 1) after a callback indicating that the graph needed to be updated and 2) after a sufficient "quiet time" in which no callbacks had occurred. However, this ran into problems when panning since the constant inflow of callbacks meant that a redraw would never occur. So in addition I needed to keep track of how long it had been since the first callback, so that if too much time had elapsed, I would redraw anyways. If this sounds confusing, it was. I have since taken a different an much simpler approach.

Now I attach a timer to the data which periodically checks to see if the axis limits have changed. If the limits have changed, the plot is redrawn. It is still possible to hit redraws in the middle of a plot change, but on average this approach works very well.

Conclusion

What started out as a a few simple optimizations turned into a solid few weeks of work spread out over the course of a couple years. Overall, I'm pretty satisfied with how the final product turned out. I told a fellow post-doc about this project 2 years ago, and he expressed interest and asked me to send me the code when I finished. Well, I can finally check that off of my list!

The approach that I settled on works pretty well, although there are a few things that could potentially be improved. I tried to use SIMD, however the code was much more complex and ran at about the same speed. I think this is most likely since the min and max SIMD instructions only support working with 4 doubles. Given the overhead of calling SIMD, and that I need to call both min and max SIMD calls, without potentially only calling min or max (i.e. if a value is the minimum, then it won't be the maximum), the whole thing is a wash. If the instructions expand from 256 bits (4 doubles) to 512 bits (8 doubles), that story might change. I've also considered trying to use a GPU to improve the code, but this is intimidating as I have never done GPU programming. Perhaps more importantly, I am unclear as to how generalizable this becomes when needing to support a wide range of graphics cards with different levels of support for GPU functions.

Monday, July 1, 2019

Writing a performant JSON parser in Matlab to facilitate scientific data exchange

This post documents my experience writing a JSON parser in Matlab. The goal was to write a parser that made reading files of moderate size not painful. Along the way I got my first exposure to within and between processor parallelization (using SIMD and openMP respectively). In particular this post was written to document some of the approaches I used to try and make parsing JSON relatively fast (using C).

Code for this post is located at:
https://github.com/JimHokanson/turtle_json

Introduction

I am part of the OpenWorm organization where our goal is to model a worm, the C. Elegans, as accurately as possible on a computer. Along the way we hope to learn general approaches for creating highly accurate models of even more complex organisms.

My role in the project has consisted of setting up a test framework for the analysis of worm movement. The goal is to be able to compare our modeled worms to real worms to guide model development.

At some point, some real worm scientists started discussing creation of a data format to facilitate sharing of data between worm labs. My interest in working on this project was pretty minimal, as I was (and still am), more interested in working on comparing the modeled worms behavior to real worms. After some prodding, I was finally convinced to help out, mainly because I was someone who was proficient in Matlab, a language that a sizable portion of worm scientists use.

The decision was made to specify the data format as deriving from JSON, a really simple text format that takes just a handful of pictures and words to describe. This is an example JSON document.

{
         "age": 34,
        "name": "Jim",
    "computer": {
                  "name": "Turtle",
                    "os": "Windows 7"
                },
     "my_bool": [true, false]
}

Exploring My Options

Before getting started on writing worm specific code I needed to find a JSON parser for Matlab. Looking at the file exchange I found many options. Here's a brief (and most likely not comprehensive) list (compiled in late 2016?).

1) JSONLab (Qianqian Fang) - written using Matlab code, probably the most popular option available
2) matlab-json (Christian Panton) - mex wrapper for json-c
3) Core_jsonparser (Kyle) - Makes calls to Python using some relatively new Matlab functionality (2015?)
4) JSON encode/decode (Léa Strobino) - mex wrapper for JSMN tokenizer
5) JSON Parser (Joel Feenstra) - Matlab code
6) Highly portable JSON-input parser (Nedialko) - Matlab code
7) (another) JSON Parser (François Glineur) - Matlab code
8) Parse JSON text (Joe Hicklin) - Matlab code
9) v8 native JSON parser (Eydrian) - wrapper for a V8 c++ parser
10) matlab-json (Kota Yamaguchi) - wrapper for a Java library

In the end, I decided to write my own JSON parser. My test case was a 75 MB file, relatively small by my standards. Using Matlab's native loading, i.e. loading the contents when they had been saved as a .mat file, the file took about 0.55 seconds to load (on my desktop using the '-v6' mat-file version). JSONLab, arguably the most popular library for reading JSON files in Matlab took 30 seconds, 18 seconds if you were using 2015b or newer due to efforts to speedup Matlab processing. Faster options such as Christian Panton's C-based parser lacked flexibility and returned everything as cell arrays that needed to be concatenated in Matlab, a massive waste of time (decreasing speed significantly).

Thus, at the time, the available JSON parsers lacked two things, speed and flexibility. To better understand why flexibility might be needed for something like a file parser, it is first useful to understand the difficulty of mapping JSON to Matlab data types.

JSON format specificity

Consider the following JSON, which you might parse as a 2D array (matrix):

[
    [1, 2, 3],
    [4, 5, 6]
]

Now consider that the next time you read this data you see:

[
    [1, 2, 3],
    [0],
    [4, 8, 9]
]

In this case the data cannot be combined into a matrix, since one of the sub-arrays has only 1 element, whereas the others have 3 elements. Instead this data must be returned as a cell array of arrays. Some JSON parsers might return a matrix in one case, and a cell array in others. This means the user has to do additional checks on what is returned from the parser. Other parsers might be more conservative and always return cell arrays, even if you always expect a matrix. This approach can be slow, as cell arrays are not memory efficient (compared to a matrix), and as indicated above, concatenating them post-hoc rather than at creation wastes time.

Ideally the user would be able to acknowledge that variability is possible, and thus to always return a cell array. Alternatively, they may expect a matrix, and thus ask for a matrix to be returned, returning an error when the data are not in matrix form.

Additionally, it is unclear whether or not the data are to be interpreted as row major or column major. JSON doesn't specify since the data are thought of as an array that holds arrays (similar to the cell array of arrays in Matlab). I don't think I saw a single Matlab JSON parser that allowed specifying whether to interpret the arrays using row-major or column-major ordering.

A similar problem exists with objects that may or may not have different properties. For example:

[{
    "a": 1,
    "b": 2
}, {
    "a": 3,
    "c": 5
}]

In the above example, the two objects have different properties ("b" vs "c"), and thus, at least in Matlab, they can't share a single concise definition. Thus again the user may be forced to deal with an array of objects (structure array) sometimes and a cell array of structures at other times. Again, ideally the user could specify how they would like the data returned to avoid variability in the output that they are working with.

Moving Forward by Tokenizing

After looking through the parsers I decided none had the desired combination of speed and flexibility. The first solution that came to mind was to find a fast JSON tokenizer that I could wrap with Matlab code. The specifics of what a tokenizer performs can vary, but it may, for example, identify the starts and stops of all strings (and objects, arrays, etc.) in the JSON file. The idea is that a tokenizer does the heavy lifting, and that some additional code is wrapped around the tokenizer to retrieve the actual data. The tokenizer provides the speed, and the wrapper provides the flexibility.

After a bit of searching, I found a JSON tokenizer called JSMN written in C (https://github.com/zserge/jsmn). I wrapped the code with a mex wrapper so that it was accessible from Matlab. Looking more closely at the code I saw one thing that I didn't like and wanted to change. Then I found another. Before I knew it I had thrown out the original code and had something completely different.

Optimizing a JSON parser

In the end I ended up writing a very fast JSON parser that I think still provides good flexibility. This flexibility comes from a two step approach: 1) tokenizing  the JSON string and 2) converting this information into Matlab data structures.

The rest of this post will discuss approaches I used to make the code run fast. Although I would like to eventually quantify the importance of these design decisions, no effort was made to quantify how important these were to code execution time.

Finally before I get started, it is important to qualify the fastness of this parser. When looking at JSON parsers, I saw a lot of references to "fast" JSON parsers. Everything was the fastest. Turtle JSON, my JSON parser, is a pretty fast JSON parser.  I may have gotten carried away with some of the optimizations but in the end it was an enjoyable learning opportunity.

Onto the details!

Step 1: State Machines and Jumps

It took me a while to realize that the task of parsing JSON can be represented as a state machine. If you see an opening object character '{', then after whitespace you expect either an attribute key - or more specifically the opening quote of the key '"' - or otherwise a closing object character '}'. This line of thinking simplified my mental processing of the parsing dramatically.

In practical terms, this means I started using GOTO statements in my C code. Normally GOTO statements are frowned upon but in this case I think they worked well. The alternative approach is to populate a variable with the next instruction to process and to run an infinite loop with a switch on the next instruction. Put another way, you have a while(1) loop with a switch statement on some character or digit as the instruction to run next. This works, but it is definitely slower. The compiler should, in some cases, optimize this overhead away, but I found that it didn't appear to (although I'm certainly not a compiler expert). This was one of the few things I think I timed at some point and I seem to remember a roughly 15% speed increase.

In some cases there are relatively few states that can occur from a given state. For example, there are only three things to do after opening an object: 1) open a key; 2) close the object; or 3) throw an error. For arrays and keys there are a few more. Here I decided to index into an array of "labels" using the character to parse. This eliminates delays caused by processing logic. No matter what the character, I always do two things: 1) index into the array using the character; and 2) GOTO the address stored at that index. Note this is unfortunately only supported by certain compilers, so at this point I switched from Visual Studio to using TDM-GCC. The setup code looks like this:

   const void *array_jump[256] = {
        [0 ... 33]  = &&S_ERROR_TOKEN_AFTER_COMMA_IN_ARRAY,
        [34]        = &&S_PARSE_STRING_IN_ARRAY,            // "
        [35 ... 44] = &&S_ERROR_TOKEN_AFTER_COMMA_IN_ARRAY,
        [45]        = &&S_PARSE_NUMBER_IN_ARRAY,            // -
        [46 ... 47] = &&S_ERROR_TOKEN_AFTER_COMMA_IN_ARRAY,
        [48 ... 57] = &&S_PARSE_NUMBER_IN_ARRAY,            // #
        [58 ... 90] = &&S_ERROR_TOKEN_AFTER_COMMA_IN_ARRAY,
        [91]        = &&S_OPEN_ARRAY_IN_ARRAY,              // [
        [92 ... 101]  = &&S_ERROR_TOKEN_AFTER_COMMA_IN_ARRAY,
        [102]         = &&S_PARSE_FALSE_IN_ARRAY,           // false
        [103 ... 109] = &&S_ERROR_TOKEN_AFTER_COMMA_IN_ARRAY,
        [110]         = &&S_PARSE_NULL_IN_ARRAY,            // null
        [111 ... 115] = &&S_ERROR_TOKEN_AFTER_COMMA_IN_ARRAY,
        [116]         = &&S_PARSE_TRUE_IN_ARRAY,            // true
        [117 ... 122] = &&S_ERROR_TOKEN_AFTER_COMMA_IN_ARRAY,
        [123]         = &&S_OPEN_OBJECT_IN_ARRAY,           // {
        [124 ... 255] = &&S_ERROR_TOKEN_AFTER_COMMA_IN_ARRAY};

Here && is used to get the address of the label. The '...' is a nice feature for filling in lots of array values. In my labels the leading 'S' represents "State". You'll note that most of the character options jump into an error state indicating that these characters are invalid given the current processing state of the parser.

The code that uses this table looks like this:

#define DO_ARRAY_JUMP goto *array_jump[CURRENT_CHAR]

I did not time the impact of this decision and I think this is probably the one design decision that may have had little impact. In particular this "optimization" gets into memory and caching issues which are beyond my grasp. Timing would be ideal, although I really like the way the code looks.

Step 2: Array or Object?

Parsing a primitive value, such as a number or string, does not depend on whether that value is in an array or in an object. However, there is some bookkeeping that needs to be done depending on what is holding the primitive. As a simple example, consider the following invalid JSON:

[1.2345}

I am currently using hilite.me (http://hilite.me/) as a cursory inspection of the source will show you. Even it knows that the closing object character '}' is invalid.

At some point I decided to keep track of whether or not I was parsing something in an array or object. This knowledge let's us know if these closing tags are valid or invalid without checking some other temporary variable. In other words, if my state is parsing a number in an array, then I don't need to do a check to tell me that the '}' character is invalid.

I was worried about doing this at first. Was I over-optimizing? In the end I think it made the code even cleaner and easier to follow. Here's an example of this type of code, for a string:

S_PARSE_STRING_IN_ARRAY:
    
    INCREMENT_PARENT_SIZE;
    PROCESS_STRING;
    PROCESS_END_OF_ARRAY_VALUE;

//====================
S_PARSE_STRING_IN_KEY:

    STORE_TAC_KEY_SIMPLE;
    PROCESS_STRING;
    PROCESS_END_OF_KEY_VALUE_SIMPLE;

Step 3: Parent updates only for complex values

This was actually one of the last changes I made. I was trying to improve my speed for a benchmark (https://github.com/kostya/benchmarks#json). The benchmark file uses structures/objects to store values. Here is a sample of the test file:

{
  "coordinates": [
    {
      "x": 0.5405492533441327,
      "y": 0.1606088575740785,
      "z": 0.28996148804190514,
      "name": "khfmzc 6328",
      "opts": {
        "1": [
          1,
          true
        ]
      }
    },
    {
      "x": 0.2032080968709824,
      "y": 0.46900080253088805,
      "z": 0.8568254531796844,
      "name": "buxtgk 4779",
      "opts": {
        "1": [
          1,
          true
        ]
      }
    },

When parsing objects or arrays I used a concept of parsing depth. In other words, in this example there is an object at depth 1, which holds a "coordinates" attribute at depth 2, which holds an array at depth 3, which holds an object at depth 4, which holds an attribute "x" at depth 5. Once I've parsed the object at depth 4, I need to go back to the array at depth 3, so you constantly need to keep track of an objects parents (i.e. the parent of the object at 4 is the array at 3).

The key insight was that we didn't care about managing parent information if we had an object attribute whose value was primitive (i.e., string, number, or logical), since after we parsed the primitive value we'd be immediately back at the object level. If however an attribute held an object or an array, then we needed to keep track of these parent relationships a bit more closely. Implementing this distinction of simple versus complex object values meant I could avoid a lot of extra bookkeeping.

To allow ourselves not to care about these primitive values, we hold off on updating depth and parents until we start parsing the value that the key holds, rather than when parsing the key string. This places extra work on parsing values in "objects" (really in the key of an object), but it speeds things up nicely. This is the opening part of the depth code that is now completely avoided when parsing primitives (as values in objects):

#define INITIALIZE_PARENT_INFO(x) \
        ++current_depth; \
        if (current_depth > 200){\
            goto S_ERROR_DEPTH_EXCEEDED; \
        }\
        parent_types[current_depth] = x; \
        parent_indices[current_depth] = current_data_index; \
        parent_sizes[current_depth] = 0;

This isn't the fanciest optimization but I really like it. It is an example of optimizing code by removing unnecessary logic. Ideally the compiler could do this as well but my experience has been that it often does not.

Step 4: SIMD (Single Instruction, Multiple Data) - Parallelization via the Instruction Set

This idea came from RapidJSON which is frequently held up as one of the fastest JSON parsers out there. The idea is that there exists a set of special instructions in the processor that can be used to execute a single instruction simultaneously on multiple elements of a data array (thus the SIMD term, single instruction multiple data). One danger of using these instructions is that they are processor dependent and you can run into problems transferring compiled code between computers. I'm currently only using one SIMD instruction, _mm_cmpistri(), which is part of what's known as the SSE4.2 set of instructions which were made available in Intel processors in 2008 and AMD processors in 2011.

The use case suggested by RAPIDJSON is to skip whitespace. In particular, this is really handy if you need to skip whitespace in files with lots of indentation (i.e. for pretty-printed JSON). The parallel nature of the code comes in comparing a set of characters in your file to a specific set of characters, in this case, whitespace characters (spaces, tabs, newlines, carriage returns). The tricky part is that this function takes multiple cycles (although less if you were to implement it as standard code), and thus you want to try and call it only when you think you are going to skip a lot of whitespace.

This code in C is as follows:

//Hex of 9,     10,     13,     32
//      htab    \n      \r     space
#define INIT_LOCAL_WS_CHARS \
    const __m128i whitespace_characters = _mm_set1_epi32(0x090A0D20);

//We are trying to get to the next non-whitespace character as fast as possible
//Ideally, there are 0 or 1 whitespace characters to the next value
//
//With human-readable JSON code there may be many spaces for indentation
//e.g.    
//          {
//                   "key1":1,
//                   "key2":2,
// -- whitespace --  "key3":3, etc.
//
#define ADVANCE_TO_NON_WHITESPACE_CHAR  \
    /* Ideally, we want to quit early with a space, and then no-whitespace */ \
    if (*(++p) == ' '){ \
        ++p; \
    } \
    /* All whitespace are less than or equal to the space character (32) */ \
    if (*p <= ' '){ \
        chars_to_search_for_ws = _mm_loadu_si128((__m128i*)p); \
        ws_search_result = _mm_cmpistri(whitespace_characters, chars_to_search_for_ws, SIMD_SEARCH_MODE); \
        p += ws_search_result; \
        if (ws_search_result == 16) { \
            while (ws_search_result == 16){ \
                chars_to_search_for_ws = _mm_loadu_si128((__m128i*)p); \
                ws_search_result = _mm_cmpistri(whitespace_characters, chars_to_search_for_ws, SIMD_SEARCH_MODE); \
                p += ws_search_result; \
            } \
        } \
    } \

Note that we first try and match a space and then we also check if the next character could be a whitespace (the whitespace characters are all less than or equal to ' ') before calling the SIMD code.

I had a hard time understanding SIMD, and the "cmpistri" (compare implicit length strings return index) function in particular is very complex, with its functionality varying considerably based on the flags that are passed in. After some help on StackOverflow (http://stackoverflow.com/questions/37266851/trouble-with-result-from-mm-cmpestri-in-c) I was able to get things working.

I also decided to use the same function to parse numbers. The reason for doing this, i.e. for skipping over numbers, is explained in the next section.

Step 5: Parallelization via Multiple Cores

In the previous section I discussed parallelization on a single core using special processor instructions. In this section I expand the parallelization to multiple cores.

I spent a decent amount of time trying to think if I could parse the entire JSON file in parallel. Although I had concocted some ideas on how it might work, I decided against it for now.

However, I am able to parse parts of it in parallel. Specifically, I decided to search to the end of numbers and strings, and then to parse them properly later. This post-processing can easily be done in parallel. Since I was skipping over the numbers initially and parsing them later, I added similar SIMD code to advance to the end of the number by looking for digits (and other numeric characters).

I decided not to use SIMD for finding the end of a string because it can be slower for short strings. In general I wanted the parser to be fast for parsing relatively large numbers (6+ digits), but I have no such size expectations for strings. (Although, note to self, we might be better off using SIMD for strings and avoiding SIMD for keys). For strings I am currently just using a loop and looking for the end of the string.  I log the start and end indices of the string to go back later and parse it. The string "parsing" consists of handling any escape sequences as well as converting UTF-8 to UTF-16 (what Matlab uses for character arrays).

I decided to use the OPENMP library for doing the post-processing of numbers and strings. This made writing parallel code extremely easy. Here is the parallel code that parses numbers. Simply by adding the pragma statement iterations of this loop get split amongst all available cores. On my 5 year old desktop, and on my 3 year old laptop, this is 4 cores, which should be a decent speedup for number (and string) parsing. I should mention in general I'm assuming most machines have four cores (or more) and that in general it is faster to use this two step approach (skip/log first then process later) rather than doing processing right away using a single core.


  #pragma omp parallel
    {
        int tid = omp_get_thread_num();
        int error_location = 0;
        int error_value;

        #pragma omp for
        for (int i = 0; i < n_numbers; i++){
            //NaN values occupy an index space in numeric_p but have a null
            //value to indicate that they are NaN
            if (numeric_p[i]){
                string_to_double_v3(&numeric_p_double[i],numeric_p[i],i,&error_location,&error_value);
            }else{
                numeric_p_double[i] = MX_NAN;
            }
        }  
        
        *(error_locations + tid) = error_location;
        *(error_values + tid) = error_value;
    }

My code throws errors when parsing, rather than passing out an error value to the caller. However Matlab states that no Matlab based calls, like mexErrMsgIdAndTxt() which is used to throw errors, should be used in parallel code. Instead each thread keeps track of its own errors, and these are later combined.

Step 6: String End Searching and End of File Handling

If we are searching for the end of a string to post-process later, then we need to find a terminating double-quote character. However, we also need to be careful of escapes that escape a double-quote, as well as the end of the file. Since we are skipping over the string (i.e. not parsing it immediately, see Step 5) we are focused on finding a double-quote character '"'. Once this has been encountered we can backtrack relatively easily to determine if the character indicates the end of the string, or if it has been escaped to indicate that the double-quote character is a part of the string itself.

In addition we also need to verify that the file doesn't end prematurely, and that the double-quote character hasn't been escaped (indicating that we need to continue parsing the string).

When reading from files, it is trivial to pad the read buffer with extra characters that are helpful to parsing. The same is currently done for strings (as an input), although unfortunately this requires memory reallocation. The question then is how many additional bytes to add, and what to put in them, and why.

I found the following buffer to be useful:

[0 '\' '"' 0 0 0 0 0 0 0 0 0 0 0 0 0 0]

By including a '"', we ensure that we never read past the end of the JSON stream. An alternative and slower approach is to examine every character for both '"' and 0.

Since we are not looking for escaped characters initially, every time we encounter '"' we need to verify that it is the end of the string, and not a part of the string. This means that we need to look at the previous character and ensure that we don't see a '\' character.

By placing the '\' character between a null character and our '"' character, we do double-duty by checking for our escape character and entertaining the possibility of having encountered our sentinel '"' character. The alternative and slower approach is to have [0 '"'], and then check for both 0 and '\' every time we encounter a '"' character.

Finally, by adding sufficient characters to our buffer we ensure that we never read past the end of the stream when using SIMD.


STRING_SEEK:    

    while (*p != '"'){
      ++p;    
    }
    
    //Back up to verify that we aren't escaped
    if (*(--p) == '\\'){
        //See documentation on the buffer we've added to the string
        if (*(--p) == 0){
            mexErrMsgIdAndTxt("turtle_json:unterminated_string", 
                    "JSON string is not terminated with a double-quote character");
        }
        //At this point, we either have a true end of the string, or we've
        //escaped the escape character
        //
        //for example:
        //1) "this is a test\"    => so we need to keep going
        //2) "testing\\"          => all done
        //
        //This of course could keep going ... \\\\"

I rewrote this section multiple times to try and keep this simple. At the end of the day the main message is as follows, by using the buffer we chose we can avoid a lot of unnecessary checks on every character on a string, while also making sure we catch errors and properly handle escaped double-quote characters.

How to go Faster

There are currently three things that are slowing this parser down that could make it faster.
  1. Everything is parsed, even if you only want a part of the file (unlike for example the following parser: https://github.com/mleise/fast , I've also seen the term SAX parser floating around). This isn't too tough for compiled code but I think it would be basically impossible to use effectively with a mix of compiled and interpreted code.
  2. This parser is generic. One alternative is to use code that generates code for parsing a specific JSON file (e.g, https://google.github.io/flatbuffers/). 
  3. I'm working with Matlab memory management which slows things down considerably. In general parsing things fast comes down to: 1) avoiding unnecessary logic; 2) parallelization and 3) memory handling.
There are still a few things that I could process using OpenMP, notably determining array and object types (whether homogenous or heterogenous), but I'm ready to move on. At this point the parser is useable (i.e. it doesn't take multiple seconds to load a 75 MB file). On my 2016 1.1 GHz m3 Macbook that 75 MB file parses in about 0.7 seconds (versus 0.73s for the mat file). On my i5-3570 3.4 GHz processor (released in Quarter 2 of 2012) it takes only 0.3 seconds, which is actually faster than the 0.55s it takes to read in the same data as a mat file (using '-v6').

Concluding Thoughts