Accelerating Accelerometer Analysis
With the driving scores project, we have begun the first round of collecting data samples of different driving events. In order to create a program that identifies and assesses driving events, we first need to create a large dataset of examples of to teach our classifier (an algorithm or statistical model) to recognize different events such as turns or texting on the phone. Once we have a method of classifying events, we can calculate a driving score by penalizing phone usage or rewarding smooth turns.
To gather event samples, two of my co-workers and I drove to the U of A, and while doing so completed a list of actions such as pretending to text, take a call, or check our phone. We recorded accelerometer and gyroscope data during the whole trip using an app called SensorLog, so that we could have data on what these actions looked like inside a driving car.
Analyzing Our Data
From our drive, we collected six sets of data: gyroscope X, gyroscope Y, gyroscope Z, accelerometer X, accelerometer Y, and accelerometer Z. Most of the data was highly obscured by the motions of the car. On the bumpy, pothole ridden roads of Tucson, driving cars are constantly moving up and down even when the car is just driving straight.
Gross.
|
In order to analyze the data, I have been using a statistical programming language called R. Because R is mostly used by people who need to perform statistical analysis, rather than software developers, it is a relatively easy language to learn and has libraries specifically tailed for data analysis. As a first step, I wanted to be able to write a program that would identify the driving events for me - i.e. to identify when I picked up my phone, or when the car turned, based off the data I can collected from my phone. To do this, I decided to focus on the gyroscope Z data. Because car motions mostly cause the phone to move around on the horizontal plane, the phone does not sense any rotation around the z-axis unless the car is turning or the phone is being picked up. Thus, the data from the z-axis gyroscope has very little noise, as can be seen below:
With little noise, we can see each driving event extremely clearly in the time series graph. The relatively flat areas are areas when the car was simply driving. The tall spikes are times when I picked up my phone to text or make a call. Right in between the 0 and 5000 points, there are two times when I picked up my phone, held it up for a while (resulting in more shaking), and then put it back down. At the 5000 mark, there is a time when I quickly picked up my phone and put it back down. The small dips, such as right after 5000 and before 1000, are turns. Because there is so much data, it is still quite hard to examine the events. So, I wrote a program in R which identifies areas where something is happening, and zooms into them to paint a clearer picture of what the data points look like.
The first time I picked up my phone, right in between the first and 5000th point. |
The first bump in the data, most likely produced by when the car made a turn. |
While simple events such as texting versus turning can be easily distinguished with only gyroscope data, more complex events such as texting while making a sharp turn versus simply turning texting more information differentiate. In addition, even when driving events that can be recognized, they cannot be assessed based of turn radius or acceleration. Later on in the project when we will be scoring driving based off accelerometer and gyroscope data, we need more information about speeds and acceleration. So, we still need to take into account the accelerometer data.
How Do We Understand Noisy Data?
When understanding time series, the goal is to identify trends or patterns in data. However, there are often random fluctuations in the data that obscure these patterns and trends. Instead of smooth straight lines, we have constant fluctuations.
As a result, the challenge with times series data is often figuring out how to distinguish the noise from the actual trends we are looking for. From a computer science perspective, this makes the problem of analyzing the data much more complex because we don't get to look for exact patterns, unless we somehow transform the noisy data.
There are many ways to transform time series data, such as with Fast Fourier Transforms, but the most common way is called segmentation. Segmenting time series data means to approximate the data using a series of straight lines. Segmenting the time series data is especially important in identifying driving events, because the driving events need to be accurately parsed out of the data, and we need to simplify the data as much as we can to efficiently classify the event types. Because time series are almost always extremely messy, there has been a large amount of research into developing segmentation algorithms to apply to noisy data. The three main algorithms used are the Sliding Window Algorithm, Top-Down Algorithm, and Bottom-up Algorithm.
1. Sliding Window - Start at the beginning of the time series, and continue adding segments of data to the current line segment until some error bound is exceeded.
2. Top-Down - The time series data is recursively divided until segments of lines ("divide and conquer") to try out every way to divide the data into sections which are each approximated with one straight line. The combination of straight lines that has the least total error is selected as the best combination.
3. Bottom-Up - The time series is divided into a bunch of tiny segments, and segments are combined if one line can approximate both of them and still be less than the specified error bound. The best combination can be determined by testing out every possible combination to see which has the least error (backwards way to do Top-Down )
Each of these algorithms are optimal for different situations. For example, the Sliding Window Algorithm is better for segmenting noisier data or data which needs to be segmented even while more data is being inputted. I am most interested in using the Bottom-Up Algorithm, because it only runs in O(n) time (meaning the time is takes for the algorithm to run is a linear function of the size of the input) and performs best when more subtle events need to be identified. Throughout next week I hope to read more about time series data in order to find the best method to segment our data, and try applying various segmentation algorithms to better understand and simplify the accelerometer data.
Citations
1. Keogh, Eamonn, et al. "Segmenting time series: A survey and novel approach."Data mining in time series databases 57 (2004): 1-22.
2. Lovrić, Miodrag, Marina Milanović, and Milan Stamenković. "Algorithmic Methods for Segmentation of Time Series: An Overview."
Wow. This is all really interesting material! Nice job Sarah. I look forward to your next post.
ReplyDelete