Problem Solving to Victory: RLE

Today I had a sort of “Boo-yah!” moment when I solved a problem I’ve been working on since the summer. Here’s the back story:

We’ve been integrating video tracking into our behavior assays in the lab, and in order to implement this method into research, I have been working with one of the math-bio undergrads on writing R-code to import and analyze each video, pulling out variables for statistical analysis later. We’re currently tracking fish swimming behavior in two dimensions from the side with the idea that we can add a third dimension with more cameras later on. We film the fish individually using a DSLR and use tracking software to extract the fish’s position at each frame in the video. The resulting data file outputs a video frame number, an x position, and a y position. From that data, we can pull out extra variables for analysis. For example, velocity can be estimated a number of ways, one of which is simply the square root of the change in x squared plus the change in y squared:

  v = \frac{1}{t} \sqrt{(\Delta x)^2 + (\Delta y)^2}

 

Once the velocities are calculated, we can analyze behaviors such as freezing, when a fish holds still and doesn’t move. Many fish will often freeze when placed into a new environment such as the filming tank. Some freeze for a long time before resuming activity, others may not freeze at all. After analyzing the freezing behavior in many many individuals, we can then run statistical tests to ask whether variation in freezing time is correlated with some other variable. But first, we have to define freezing. Because of the nature of tracking software, even when the fish isn’t moving, the tracking spot is, even if it’s ever so slight. This produces some noise in our data where the fish never actually reaches a velocity of 0. Therefore, we have to define a threshold where any velocity below that threshold is considered freezing and any velocity above that threshold is considered to be active. Here’s what happens when we simply define freezing below a threshold:

In this graph of Velocity over Time, the red dots represent velocities under the freezing threshold and green represents activity. Unfortunately, during the bout of freezing, some of the data points read as freezing. There are also a few points in the active zone that read as freezing. These points are where a fish reaches the wall of the tank and has to turn around. For a few frames, the velocity drops just low enough to be called freezing. Now, I could simply raise the threshold and solve most of the problem during the freezing bout, but I’d also get more freezing points when the fish is simply turning. Note the dilemma: How to isolate these few frames that are marked out of place and correct them.

For months, I’ve been pondering how to do this. I simply want to find a way to recognize streaks of data that mark True for freezing (or False) below a certain threshold streak, say 10 or 15 frames. In other words, any streak of True less than 15 frames should be turned into False, and so forth. Finally, last week I stumbled upon the rle() function: Run Length Environment. If you run rle() on a vector of categorical data (True, False), it will list every run length where the data are the same. So, for example: a data set that looks like this:

TRUE  TRUE  TRUE FALSE TRUE  TRUE  TRUE TRUE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE

will report the following with the rle() function:

Run Length Encoding
lengths: int [1:6] 3 1 4 2 5 1
values : logi [1:6] TRUE FALSE TRUE FALSE TRUE FALSE

So, great! Now I can identify runs of similar data. The only problem is, I need to identify which data points belong to each of those runs. After playing with it some more, I found that the output of rle() is simply a list, with $lengths and $values. So, make a new column in the data frame using the rep() function: rep(rle(x)$lengths,rle(x)$lengths). This tells R that I want to repeat the vector of run lengths, and each run length will be repeated the number of times as its length. Bear with me here. If the lengths were c(1,2,3,4,5), the output would look something like:
1 2 2 3 3 3 4 4 4 4 5 5 5 5 5. In fact, if you take the sum of rle(x)$lengths, you should get the total number of original data points. This new column of the data frame now associates every data point with its run length with respect to freezing. Now I can simply subset the data, finding all of the False where the run length is less than 15, and change it to True. Now I re-run the rep(rle(x)$lengths,rle(x)$lengths) and find all of the runs of True that are less than 15 and change them to False. And here’s the result:


Boo-yah! Problem solved!

Now I’m simply playing with different threshold values for both freezing and run length to get the most accurate freezing times. But now, I’ll not just be able to analyze freezing behaviors, but also swimming behaviors as a percentage of active swimming time, for example, time spent in different vertical zones. Of course, the number of variables to pull out of three simple numbers: x, y, and t(ime) is endless. But right now, I feel pretty good about solving this problem that’s been stumping us since last summer.

Similar Posts

  • Of data and formatting

    I’ve been in graduate school now for a year and a half, and I have yet to reveal just what it is I’m working on. Almost immediately coming in, I took over a project originally designed and run by Mary Oswald, the previous doctoral student who had graduated over the summer and left the lab officially at the end of my first semester. The project, in essence, is a selection experiment to analyze the genetic component of boldness, a trait often associated with the evolutionary process of domestication. Our model organism is the Zebrafish, Danio rerio. The back story is…

  • |

    My Research: What I have been up to

    Last week, the fruits of my last three year’s work has finally come to fruition in the journal PLoS One. The premise is that the personality behavior we call boldness, or the bold-shy continuum, is not only heritable, but a genetically correlated multivariate trait. The research is essentially a continuation of a project Mary Oswald completed for her dissertation, however upon first submission, reviewers criticized the study for its lack of replication. So, in the Summer of 2010, she set up a second selection experiment which I took over and have been maintaining since. Boldness is an interesting behavior to…

  • |

    Evo-WIBO weekend recap

    This past weekend, I was in Port Townsend, WA for the bi-annual Evo-WIBO meeting. Evo-WIBO is a small, regional gathering of someo of the biggest names in evolutionary biology in the Pacific northwest. Its only a day and a half long and very informal. Yet, because of its size and intimacy, I got more out of this meeting than I did when I presented at the much larger SICB meeting in January. I met and hung out with more people than I would have, which made it a lot of fun. The only downside is that it was so nice…

  • |

    July 4 Update and Clara in Spokane

    It’s been over a month since my last update. There haven’t been any major adventures due to time and financial constraints. As I aim to write and finish my dissertation, the time for such outings decreases and thus this summer will be nicknamed “the summer of no fun.” Fun isn’t completely off the table, but the number and scope of such expeditions will be reduced compared to past years. I did have one bit of adventure in June. I traveled to Austin for the 2016 Evolution meeting where I presented some results from our behavioral simulation experiments. With our latest…