I'm looking at data from the Twitter API. One of the data points I am interested in is location (country). Very few tweets are actually geo-tagged, but users have the option of entering their "location" in their user profile. Now this is a free-form field, so the formatting is very inconsistent (sometimes country, sometimes city, sometimes US-state, sometimes "in my parents' basement"). Excerpt below.
Question: Does anyone have suggestions on how to parse this type of data in order to get the COUNTRY of the user for the largest possible number of users? I can probably cobble something together with regular expressions, but does anyone know of a library or API that does this efficiently? Any ideas would be appreciated!
|9||New York, NY|
|12||New York, New York|
|15||San Diego, CA|