Hi folks,
I'm looking at data from the Twitter API. One of the data points I am interested in is location (country). Very few tweets are actually geo-tagged, but users have the option of entering their "location" in their user profile. Now this is a free-form field, so the formatting is very inconsistent (sometimes country, sometimes city, sometimes US-state, sometimes "in my parents' basement"). Excerpt below.
Question: Does anyone have suggestions on how to parse this type of data in order to get the COUNTRY of the user for the largest possible number of users? I can probably cobble something together with regular expressions, but does anyone know of a library or API that does this efficiently? Any ideas would be appreciated!
| 1 |
Asia Pacific |
| 2 |
Australia |
| 3 |
NA |
| 4 |
NA |
| 5 |
Europe |
| 6 |
Austin, TX |
| 7 |
Bali, Indonesia |
| 8 |
NA |
| 9 |
New York, NY |
| 10 |
Königstetten |
| 11 |
Austin, TX |
| 12 |
New York, New York |
| 13 |
London |
| 14 |
Nairobi |
| 15 |
San Diego, CA |
| 16 |
Mexico |
| 17 |
Bucharest |
| 18 |
NA |
| 19 |
Southern California |
| 20 |
India |