Today I had a sort of “Boo-yah!” moment when I solved a problem I’ve been working on since the summer. Here’s the back story:

We’ve been integrating video tracking into our behavior assays in the lab, and in order to implement this method into research, I have been working with one of the math-bio undergrads on writing R-code to import and analyze each video, pulling out variables for statistical analysis later. We’re currently tracking fish swimming behavior in two dimensions from the side with the idea that we can add a third dimension with more cameras later on. We film the fish individually using a DSLR and use tracking software to extract the fish’s position at each frame in the video. The resulting data file outputs a video frame number, an x position, and a y position. From that data, we can pull out extra variables for analysis. For example, velocity can be estimated a number of ways, one of which is simply the square root of the change in x squared plus the change in y squared:

  v = \frac{1}{t} \sqrt{(\Delta x)^2 + (\Delta y)^2}

 

Once the velocities are calculated, we can analyze behaviors such as freezing, when a fish holds still and doesn’t move. Many fish will often freeze when placed into a new environment such as the filming tank. Some freeze for a long time before resuming activity, others may not freeze at all. After analyzing the freezing behavior in many many individuals, we can then run statistical tests to ask whether variation in freezing time is correlated with some other variable. But first, we have to define freezing. Because of the nature of tracking software, even when the fish isn’t moving, the tracking spot is, even if it’s ever so slight. This produces some noise in our data where the fish never actually reaches a velocity of 0. Therefore, we have to define a threshold where any velocity below that threshold is considered freezing and any velocity above that threshold is considered to be active. Here’s what happens when we simply define freezing below a threshold:

In this graph of Velocity over Time, the red dots represent velocities under the freezing threshold and green represents activity. Unfortunately, during the bout of freezing, some of the data points read as freezing. There are also a few points in the active zone that read as freezing. These points are where a fish reaches the wall of the tank and has to turn around. For a few frames, the velocity drops just low enough to be called freezing. Now, I could simply raise the threshold and solve most of the problem during the freezing bout, but I’d also get more freezing points when the fish is simply turning. Note the dilemma: How to isolate these few frames that are marked out of place and correct them.

For months, I’ve been pondering how to do this. I simply want to find a way to recognize streaks of data that mark True for freezing (or False) below a certain threshold streak, say 10 or 15 frames. In other words, any streak of True less than 15 frames should be turned into False, and so forth. Finally, last week I stumbled upon the rle() function: Run Length Environment. If you run rle() on a vector of categorical data (True, False), it will list every run length where the data are the same. So, for example: a data set that looks like this:

TRUE  TRUE  TRUE FALSE TRUE  TRUE  TRUE TRUE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE

will report the following with the rle() function:

Run Length Encoding
lengths: int [1:6] 3 1 4 2 5 1
values : logi [1:6] TRUE FALSE TRUE FALSE TRUE FALSE

So, great! Now I can identify runs of similar data. The only problem is, I need to identify which data points belong to each of those runs. After playing with it some more, I found that the output of rle() is simply a list, with $lengths and $values. So, make a new column in the data frame using the rep() function: rep(rle(x)$lengths,rle(x)$lengths). This tells R that I want to repeat the vector of run lengths, and each run length will be repeated the number of times as its length. Bear with me here. If the lengths were c(1,2,3,4,5), the output would look something like:
1 2 2 3 3 3 4 4 4 4 5 5 5 5 5. In fact, if you take the sum of rle(x)$lengths, you should get the total number of original data points. This new column of the data frame now associates every data point with its run length with respect to freezing. Now I can simply subset the data, finding all of the False where the run length is less than 15, and change it to True. Now I re-run the rep(rle(x)$lengths,rle(x)$lengths) and find all of the runs of True that are less than 15 and change them to False. And here’s the result:


Boo-yah! Problem solved!

Now I’m simply playing with different threshold values for both freezing and run length to get the most accurate freezing times. But now, I’ll not just be able to analyze freezing behaviors, but also swimming behaviors as a percentage of active swimming time, for example, time spent in different vertical zones. Of course, the number of variables to pull out of three simple numbers: x, y, and t(ime) is endless. But right now, I feel pretty good about solving this problem that’s been stumping us since last summer.