Hi folks,
I'm looking at data from the Twitter API. One of the data points I am interested in is location (country). Very few tweets are actually geo-tagged, but users have the option of entering their "location" in their user profile. Now this is a free-form field, so the formatting is very inconsistent (sometimes country, sometimes city, sometimes US-state, sometimes "in my parents' basement"). Excerpt below.
Question: Does anyone have suggestions on how to parse this type of data in order to get the COUNTRY of the user for the largest possible number of users? I can probably cobble something together with regular expressions, but does anyone know of a library or API that does this efficiently? Any ideas would be appreciated!
1 | Asia Pacific |
---|---|
2 | Australia |
3 | NA |
4 | NA |
5 | Europe |
6 | Austin, TX |
7 | Bali, Indonesia |
8 | NA |
9 | New York, NY |
10 | Königstetten |
11 | Austin, TX |
12 | New York, New York |
13 | London |
14 | Nairobi |
15 | San Diego, CA |
16 | Mexico |
17 | Bucharest |
18 | NA |
19 | Southern California |
20 | India |