Byte 4 v1

  • Description: Your final product will be a visualization of data about someone's transportation activities. A primary goal of this project is to address data quality issues, which will be judged ... [more]
  • Due date: TBD
  • Grading: More details to be provided.

Installing AWARE on your phone (Only for those with Android phones)

This section of the tutorial is OPTIONAL since we know that not everyone has an android phone. If you do not have an Android phone that supports this APK you will have to work with the default data mentioned in class instead of your own personal data. In that case, you can skip this section of the tutorial. 

Those of you with android phones may install an "APK" (android application) for sensing activity on your phone (which will be made available on blackboard). This application is based on the basic client provided as part of the Aware framework, but set up correctly with variables specific to our class. To install it, email yourself the APK and then click on the email attachment on your phone

AWARE is open source and you might find it interesting to expand the set of sensors and capabilities of this application as part of a final project. Many off the shelf sensors are included that require no programming. Additional things can be done with programming. However since mobile application development is not a major focus of this class, our tutorial will not include further information about AWARE development.

Installing the APK

When you install the APK, it will automatically activate location tracking (using network and GPS), activity recognition (using Google Activity Recognition), mobile ESM, the web server and MQTT server. It will prompt you to install Google Activity Recognition plugin from the AWARE server (you should do so). By default, it is set to sync data to the server every 5 minutes (seems reasonable enough – you can set it higher or lower in the AWARE Client web service settings). Data is only uploaded if you are connected to a Wi-Fi connection when the upload is triggered (you can change this setting too). 

Increasing Availability

Once you install the application, it should start logging data to the server. However, Android kills "non-essential" services that are running in the background when there is an issue with memory, battery, and so on. If you want to ensure that your application is running at all times, you must activate the AWARE framework in the Accessibility services (in your phone settings). This is not required, but will increase the reliability of your data. Note this will cause your phone to talk to you on Samsung devices only due to an accessibility bug. 

Workaround for Samsung Bug

- Go to Settings -> Accessibility -> Talkback « ON »
- Select « Settings » from the bottom of the page
- Uncheck all Talkback options
- Switch Talkback « Off » again
- Go back to Settings -> Application manager -> All
- Disable Google TTS (text to speech)
- Disable Samsung TTS (text to speech)

If you’re experiencing the same issue on another phone, try this :

- Go to Settings -> App Manager -> Select GoogleTTS -> Disable
- Go to Settings -> App Manager -> Select SamsungTTS (on a Samsung phone) -> Disable

Testing your setup

To test if your phone is correctly working and hooked up, open up the web page http://[host]/aware/index.php/byte4/dashboard and send a message to your phone:
1 - Manually input your device MQTT ID (into the text entry labeled "MQTT ID:" at top right in the image below). Do not change the menu selection for "available clients" 

2 - Write the topic in which you wish to send your message: to test the device connection to the server, use “esm” without the quotes.
3 - In the message copy and paste this:

[{'esm': { 'esm_type': 1, 'esm_title': 'ESM Freetext', 'esm_instructions': 'The user can answer an open ended question.', 'esm_submit': 'Next', 'esm_expiration_threashold': 60, 'esm_trigger': 'AWARE Tester' }}]

4 - Press “Send”

If everything goes well (i.e., your device is currently connected to the server, the server is up and running), you should see a question on your phone’s screen.

Finding your phone's ID

You will need your phone's ID to filter the database results so that you only get data from your phone. The ID is found as follows:

Open the AWARE Sensors icon from the mobile phone’s launcher and you should be able to see your device ID as AWARE Device ID, a UUID (Unique Universal ID) alphanumeric value that is assigned to their client when they installed the Byte4.apk package.

Accessing the data

Data from *all* phones that have the above APK installed (my phone, your phone if you have one, and the phones of any classmates of yours) will be uploaded to a single google cloud database. This database is associated with my account. However I can grant access to each of your applications. For this reason, you will each need to create your google app and send me your application name (e.g. jmankoff-byte4)You won't be able to access your own data or the default data until you do (although you can always use the default data instead). 

Because I 'own' the data and it is complicated and introduces security concerns to make it fully accessible to all of you from outside the google application sandbox, you will have to upload your code to google appspot each time you want to test (and then load [yourname] There will be no local option, unfortunately. The library we will be using to access the data is MySQLdb (documentation). You will need to (1) download it and install it inside your Google Appspot directory (2) add it to 'app.yaml' (or copy it over from my github version of byte4)

- name: MySQLdb
  version: "latest"

and (3) in '': 

import MySQLdb

Now we can access the database. The following code:
_INSTANCE_NAME = 'jmankoff-byte4:aware'
_DB = 'byte4'

class MainHandler(webapp2.RequestHandler):
    def get(self):
        self.response.headers['Content-Type'] = 'text/plain'
        if (os.getenv('SERVER_SOFTWARE') and
             os.getenv('SERVER_SOFTWARE').startswith('Google App Engine/')):

            // connect to the database
            db = MySQLdb.connect(unix_socket='/cloudsql/' + _INSTANCE_NAME, db=_DB, user='root')
            cursor = db.cursor()

            // execute the query SHOW TABLES
            cursor.execute('SHOW TABLES')
            // fetch the results and display them
            for r in cursor.fetchall():
                self.response.write('%s\n' % str(r))

            self.response.write('Need to connect from Google Appspot')

will show the names of all the tables in the database:
('applications_crashes',) ('applications_foreground',) ('applications_history',) ('applications_notifications',) ('aware_device',) ('esms',) ('locations',) ('mqtt_messages',) ('mqtt_subscriptions',) ('plugin_google_activity_recognition',) ('plugin_mode_of_transportation',)

What are the key tables, and what are their contents?

Table: plugin_google_activity_recognition -- contains activities recognized by Google
  • _id: integer - primary key, auto incremented, just numbers each row
  • timestamp: double - unixtime in milliseconds of when we collected this data point
  • device_id: varchar - the unique device who sent this data.
  • activity_type: integer - a constant number assigned by Google for each activity_name (from here: Values are:
    • in_vehicle: 0
    • on_bicycle: 1
    • on_foot: 2
    • still: 3
    • tilting: 5
    • unknown: 4
  • activity_name: varchar - a human-readable description of the physical activity. Possible values are: in_vehicle, on_bicycle, on_foot, still, unknown, tilting.
    • in_vehicle: the user is in a car/bus/train
    • on_bicycle: the user is biking
    • on_foot: the user is walking/running
    • still: the user is stationary somewhere
    • unknown: Google’s algorithm has no idea of what you are doing
    • tilting: the user has the phone on his hands/desk and its moving somehow minimally.
  • confidence: integer - Returns a value from 0 to 100 indicating the likelihood that the user is performing this activity. The larger the value, the more consistent the data used to perform the classification is with the detected activity.
  • activities: JSON Array with JSON objects with other potential activities the user might be doing. Each JSON object has two values: activity (which contains the activity_name) and confidence (as before). The sum of the confidences of all detected activities for a classification will be <= 100. This means that larger values such as a confidence of >= 75 indicate that it's very likely that the detected activity is correct, while a value of <= 50 indicates that there may be another activity that is just as or more likely.

Table: locations - the data from Google Fused Locations
  • _id: integer - primary key, auto incremented, just numbers each row
  • timestamp: double - unixtime in milliseconds of when we collected this data point
  • device_id: varchar - the unique device who sent this data
  • double_latitude: double - the latitude of coordinates
  • double_longitude: double - the longitude of coordinates
  • double_bearing: double - to where (in degrees) is the user heading (used to know in which direction is the user moving).
  • double_speed: double - how fast is the user moving in meters/second
  • double_altitude: double - how high above sea level in meters is the user
  • provider: varchar - how was this location fix established. One of: fused, network, gps.
  • accuracy: double - how accurate is the location fix, in meters.
  • label: varchar - a label for this location, if it exists. Some plugins can use it to label a specific location (e.g., home, work, etc).

Exploring the data

For exploring the data, you will want to always select for only data that matches the device ID you care about. For those of you doing this assignment who do not have your own android devices, my device id is '785434b8-ce03-46a2-b003-b7be938d5ff4'. This is also the default device ID used in the code. You can see all of the device IDs currently registered by using the query:

SELECT DISTINCT device_id FROM locations

The word DISTINCT in this query simply ensures that only unique values for the column device_id are returned.

There are a few queries you may find useful to use for exploring the data initially. For example, 

SELECT FROM_UNIXTIME(timestamp/1000,'%Y-%m-%d') AS day_with_data, COUNT(*) AS records FROM locations WHERE device_id = '670b950e-f821-4e40-bb6c-547e4499c9c5' GROUP by day_with_data; 

will show the number of days for which location data was collected. This query creates a string for each row (record) in the table 'location' which specifies the year, month and day that row was recorded. The rows are grouped based on this string, and then counted. In more detail:

  • We've already introduced SELECT in Byte 2. Normally, SELECT takes the names of columns (or *) as arguments, which specifies which columns of the table should be selected. 
  • The command FROM_UNIXTIME(...) is used instead of the column name. It takes as input a unix timestamp (number of seconds since January 1, 1970) which we need to convert it to a format that will be useful in our query -- namely Year-Month-Day format. The 'timestamp' column of the locations table is stored in milliseconds since 1/1/1970 so we first divide by1000. 
  • The command AS simply provides a name for what we just did (in this case the name 'day_with_data' for the formatted timestamps we are producing)
  • COUNT(*) will count the number of records in a table. However, because we end by saying GROUP BY, the count will be divided up into one count per group (the number of rows in each group). Groups are defined using 'day_with_data', which was the name we gave our timestamps. 
  • FROM name specifies which table this stuff will all be found in
  • WHERE device_id = '...' specifies that we should only look at records with that specific device_id (i.e. records recorded by my phone, or yours if you change the id)

The result, for my data, looked something like this at the end of january (of course, this will change over time).

#Days with data from location data
('2013-12-17', 415L)
('2013-12-18', 1216L)
('2013-12-19', 241L)
('2013-12-20', 81L)
('2013-12-21', 820L)
('2014-01-08', 371L)
('2014-01-09', 1110L)

It is also possible to trace the duration of events. Because of the complexities of these queries, I'm going to show the python code used to construct them. This code helps me have more clarity with respect to the queries because I can use variables to represent the unix time formatting and other key things. The key thing to realize about this code is that we are simply putting together a string that will then be used as a query. The variables _ID and _ACTIVITY are global variables defined at the top of ''. _ID holds the device id for my phone, _ACTIVITY is the name of the table in which activity information is stored.

# turns a unix timestamp into Year-month-day format
day = "FROM_UNIXTIME(timestamp/1000,'%Y-%m-%d')"
# turns a unix timestamp into Hour:minute format
time_of_day = "FROM_UNIXTIME(timestamp/1000,'%H:%i')"
# calculates the difference between two timestamps in seconds
elapsed_seconds = "(max(timestamp)-min(timestamp))/1000"
# the name of the table our query should run on
table = _ACTIVITY
# turns a unix timestamp into Year-month-day Hour:minute format
day_and_time_of_day = "FROM_UNIXTIME(timestamp/100, '%Y-%m-%d %H:%i')"
# Groups the rows of a table by day and activity (so there will be one 
# group of rows for each activity that occurred each day.  
# For each group of rows, the day, time of day, activity name, and 
# elapsed seconds (difference between maximum and minimum) is calculated, 
query = "SELECT {0} AS day, {1} AS time_of_day, activity_name, {2} AS time_elapsed_seconds FROM {3} WHERE device_id='{4}'  GROUP BY day, activity_name, {5}".format(day, time_of_day, elapsed_seconds, table, _ID, day_and_time_of_day)

will show in order each activity that occurred during each day that data was recorded, along with the duration of that activity.  

Debugging Appspot Code Online

Because we are working only with uploaded versions of this code, the logging information is no longer available in the Google Appspot log on your local machine. Instead, you will need to go to to view your logs. If you click on the appropriate application link and then click on 'Logs' you will be able to see something like this: 

Clicking on the + sign next to to the top log will show you the details of the most recent attempt to run your application (what would have been the log on your local machine). will print to this, and errors will show up here. Of course if and when you use any javascript to display information you may also still need to use your browser developer tools to look at the javascript console.

Identifying Common Locations

One of the first things we will want to do with mobile data is identify common locations. This is most easily done by combining work in SQL with work in Python. In particular, we will first run a query             
query = "SELECT double_latitude, double_longitude FROM {0} WHERE device_id = '{1}'".format(_LOCATIONS, _ID)

To make life easier, I have written a helper function in '' that runs a query. The function also checks if the query failed and logs as much information as possible using on failure. It returns the query results in a list, and is called make_query (you can find it in the source code on github). Once we have the locations in a list, we need to identify unique locations. For this, I have again written a helper function, called bin_locations, which takes a list of locations as input and compares them pairwise. The algorithm used is fairly simple: Store the first location in a dictionary under the bin name 1. For each new location, check its distance from each stored location already in the dictionary. If it is within epsilon of any of the stored locations, ignore it (it is not 'new'). If it is further than epsilon from all stored locations, add it to the dictionary under a new bin name -- it is a new location. Distance is calculated by the helper function distance_on_unit_sphere(..) which is based on John Cook's implementation and explanation. 

def bin_locations(self, locations, epsilon):
    # always add the first location to the bin
    bins = {1: (locations[0][0], locations[0][1])}
    # this gives us the current maximum key used in our dictionary
    num_places = 1

    # now loop through all the locations 
    for location in locations:
        lat = location[0]
        lon = location[1]
        # assume that our current location is new for now 
        # (hasn't been found yet)
        place_found = False
        # loop through the bins 
        for place in bins.values():
            # check whether the distance is smaller than epsilon
            if self.distance_on_unit_sphere(lat, lon, 
                                            place[0], place[1]) < epsilon:
                #(lat, lon) is near (place[0], place[1]), 
                # so we can stop looping
                place_found = True

        # we weren't near any of the places already in bins
        if place_found is False:
  "new place: {0}, {1}".format(lat, lon))
            # increment the number of places found and create a 
            # new entry in the dictionary for this place. Store the
            # lat lon for comparison in the next round of the loop
            num_places = num_places + 1
            bins[num_places] = (lat, lon)

    return bins.values()

The results when I run this on my data are: 

 latitude         longitude hand calculated address My explanation of this
 40.4520251   -79.943495499999997 436 Morewood Avenue, Pittsburgh, PA         Near CMU / my home
 40.435558200000003 -79.863789999999995 400-503 Sherwood Road, Wilkinsburg, PA   On way to Hidden Valley
 40.442977900000002   -79.781972100000004 Penn Lincoln Parkway, Monroeville, PA  On way to Hidden Valley
 40.355172099999997 -79.683622200000002 Pennsylvania Turnpike, Irwin, PA On way to Hidden Valley
 40.264337300000001 -79.655496499999998 1937 Main Street, New Stanton On way to Hidden Valley
 ... ... ... ... several more of these

This summary immediately highlights several issues. The first is required to solve for this assignment. The remainder are optional (although would be very important in working further with this data than this byte takes you). 
  • First, the granularity here is to gross. In particular, everything happening within Pittsburgh is grouped at a single address (436 Morewood Avenue). This means that epsilon is probably not correct and needs to be tweaked. 
  • Second, we cannot differentiate between places where I spend a lot of time and places where I am just traveling through based on the data given. I have looked up the associated address by hand using and interpreted it for you (on the right), but nothing automatic is taking place. A very simple way to fix this would simply be to keep track of the number of matches for each location. Places where I spend a lot of time should have a lot more values logged. This would be relatively easy to add to the information stored in each bin. 
  • The third problem is that that the 'label' for a location is whatever shows up first in the log in the general area of that popular spot. This is not as big an issue as the other two, but could be fixed by keeping a count for all of the lat/lon pairs within epsilon of the first location found, and then using the most popular one as the label. 
  • Lastly, investigating this data will be hard if we are always calculating addresses by hand (since lat/lon is essentially meaningless to the reader). It might make your life easier if you could do this automatically, and the google geocoding api is one way to do so.

Associating Activities with Locations

At this point, after working with my code and adding to it, you should have a reasonably descriptive list of locations (i.e. a reasonable epsilon) and be able to divide them into locations, possibly grouped into those that are visited a lot (common locations) and locations that are visited only occasionally. Our next step is to explore what activities take place in and around these locations.  To do this, we first need to make a query that will get locations and activities, both organized by day and hour. We can also collect the elapsed time 

time_of_day = "FROM_UNIXTIME(timestamp/1000,'%H:%i')"
day = "FROM_UNIXTIME(timestamp/1000,'%Y-%m-%d')"
query = "SELECT {0} as day, {1} as time_of_day, double_latitude, double_longitude FROM {2} WHERE device_id = '{3}' GROUP BY day, time_of_day".format(day, time_of_day, _LOCATIONS, _ID)
locations = self.make_query(cursor, query)

day_and_time_of_day = "FROM_UNIXTIME(timestamp/100, '%Y-%m-%d %H')"
elapsed_seconds = "(max(timestamp)-min(timestamp))/1000"
query = "SELECT {0} as day, {1} as time_of_day, activity_name, {2} as time_elapsed_seconds FROM  {3} WHERE device_id = '{4}' GROUP BY day, activity_name, {5}".format(day, time_of_day, elapsed_seconds, _ACTIVITY, _ID, day_and_time_of_day)
activities = self.make_query(cursor, query)

# now we want to associate activities with locations. This will update the
# bins list with activities.
self.group_activities_by_location(bins, locations, activities, _EPSILON)

Once we have locations, activities, and bins, we can look up the location for each activity (using the day and hour that the activity occurred at as an index into the list of locations) and then look up the bin for the activity (using its location):
def group_activities_by_location(self, bins, locations, activities, epsilon):
        location_index = 0
        day = 0
        hour = 1
        lat = 2
        lon = 3
        # a place to store activities for which we couldn't find a location
        # (indicates an error in either our data or algorithm)
        no_loc = []
        for activity in activities:
            # collect the information we will need 
            aday = activity[day]
            ahour = activity[hour]
            aname = activity[2]
            aduration = activity[3]
            # loop through the locations
            for i in range(len(locations)):
                # if we found a match
                if ((locations[i][day] == activity[day]) and 
                    (locations[i][hour] == activity[hour])):
                    # we find the correct bin
                    bin = self.find_bin(bins, locations[i][lat],
                                        locations[i][lon], epsilon)
                    # and add the information to it
                    bins[bin] = bins[bin] + [aname, aduration]
                    # otherwise record the error in no_loc
                    no_loc.append([aname, aduration])
        # add no_loc to the bins

The result looks something like this. Each item has a lat, a lon, and then a series of activities and durations. Note that some locations don't seem to have an activity, and the very last set of things are activities for which we could not find a location that matched. Determining the cause of this is left as an exercise to the curious (we are no longer in the 'quality checking' phase of the semester, though that is certainly a legitimate and necessary thing to do with mobile code.

[40.4520251, -79.943495499999997, 'in_vehicle', 39392.071000000004, 'on_foot', 39727.334000000003, 'still', 70414.203999999998, 'tilting', 70699.740000000005, 'unknown', 68273.095000000001, 'in_vehicle', 81884.464000000007, 'on_foot', 40043.336000000003, 'still', 84955.171000000002, 'tilting', 79536.949999999997, 'unknown', 80886.019, 'still', 86292.365999999995, 'in_vehicle', 26267.181, 'on_foot', 69574.019, 'still', 86338.226999999999, 'tilting', 72176.975999999995, 'unknown', 69511.421000000002, 'in_vehicle', 77076.141000000003, 'on_bicycle', 47379.989999999998, 'on_foot', 76726.036999999997, 'still', 80446.138999999996, 'tilting', 77615.462, 'unknown', 77532.361000000004]
[40.435558200000003, -79.863789999999995]
[40.442977900000002, -79.781972100000004]
[40.355172099999997, -79.683622200000002]
[40.264337300000001, -79.655496499999998, 'in_vehicle', 61206.953000000001, 'on_foot', 59838.290000000001, 'still', 85463.236000000004, 'tilting', 79628.938999999998, 'unknown', 80889.201000000001]
[40.159720700000001, -79.479924699999998]
[40.128560100000001, -79.404786000000001, 'on_foot', 253.934]
[40.042347599999999, -79.228534199999999, 'on_bicycle', 8648.982, 'in_vehicle', 5743.6329999999998, 'still', 86355.557000000001, 'tilting', 82030.725000000006, 'unknown', 81165.418000000005]
[40.121983999999998, -79.303508800000003]
[40.210303699999997, -79.579402999999999]
[40.409026599999997, -79.719704500000006]
[['in_vehicle', 4357.4719999999998], ['on_foot', 10664.239], ['still', 28324.585999999999], ['tilting', 26078.298999999999], ['unknown', 28031.917000000001], ['still', 38368.474999999999], ['tilting', 0.0], ['unknown', 0.0]]

Other Useful Resources:

Hand In Expectations

The tutorial above took you through the point where you could display in (very ugly form) the relationship between activities and locations. Your assignment is to explore something similar -- the relationship between activities and time of day. 
  • You will be asked to explain in general terms what you learned about the relationship between time and activity
  • Which user's data you used (yourself or the instructor)
  • For a working version of your code that displays the results. You have the option to display the results in HTML (please make them nicer than my example though) or to make some sort of interactive chart in D3 (interaction is not a requirement of this assignment).
  • You will also be asked what epsilon you picked for binning locations and why (even though this is not the main focus of the code you will write)