How to know whether the email address contains the first/last name?

I have a bunch of email addresses, and I want to know how many of those addresses contain the first name or the last name. Later, I want to know the users' gender based on their first name and their ethnicity based on their last name. Do you have any suggestions?

Really appreciate any suggestions or help.

I have a bunch of email addresses, and I want to know how many of those addresses contain the first name or the last name. Later, I want to know the users' gender based on their first name and their ethnicity based on their last name. Do you have any suggestions? For example, the example email addresses are shown below.

12345@gmail.com

linayu@gmail.com.

Really appreciate any suggestions or help.

I am thinking of this as a two part project. Part 1 is is the string manipulations of removing the characters starting with @ and to the right, and then extracting the remaining characters but splitting them at a delimeter such as period or underscore (assuming such a delimeter exists). So "john.lewis@gmail.com" becomes "john" and "lewis". I will leave it to someone adept at string manipulation to advise on this.

Part 2 is then determining gender and ethnicity. In the US, the Social Security department compiles a list of first names by gender. See the package gender. This will permit probabilities of gender by first name; for example Madison as a first name has a 98.5% probability of being female based on this database. Of course, if your data is not US, you would need some other database to identify gender.

The package rethnicity will predict ethnicity. I don't know the source of its data.

An interesting dilemma is that even separating "john.lewis@gmail.com" into "john" and "lewis", we don't know which is the first name and which is the last. I don't know how to resolve this one.

2 Likes

Thanks for your reply and suggestions. The problem is that, as my first step, I need to tell whether the email addresses contain information about the first name or last name or not. This is because I have some email addresses that contain some numbers or some letters, for example, 123445@gmail.com or ABCD@gmail.com. Do you have any suggestions?

Really appreciate your reply.

Hi. You can certainly subset to delete numbers, but I'm not sure your problem can be solved. My email address is fcas80@... It could be short for Frank Castillo or something like that, or it could have no relationship to my name.

1 Like

To make the analysis you should have several filters.
For example analyze emails containing two names and another analysis with other situations encountered.

This may be useful to find patterns in the databases.

This topic was automatically closed 42 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.