R converting into more friendly names


#1

I have a list of hostnames that i would like to convert to a more friendly names in R. Is this possible to do please?

Host name
95b4ae6d890e4c46986d91d7ac4bf08200000W
95b4ae6d890e4c46986d91d7ac4bf08200000W
95b4ae6d890e4c46986d91d7ac4bf08200000V
95b4ae6d890e4c46986d91d7ac4bf08200000V
95b4ae6d890e4c46986d91d7ac4bf08200000Z
95b4ae6d890e4c46986d91d7ac4bf08200000Z
95b4ae6d890e4c46986d91d7ac4bf082000011
95b4ae6d890e4c46986d91d7ac4bf082000011
95b4ae6d890e4c46986d91d7ac4bf082000011
95b4ae6d890e4c46986d91d7ac4bf082000011
95b4ae6d890e4c46986d91d7ac4bf08200000H
95b4ae6d890e4c46986d91d7ac4bf08200000H


#2

you could do this all sorts of ways. What did you have in mind?

You could map each of these to a number. Or you could map each to the name of a former President of the US. Or you could make each of them a noble gas.


#3

I was hoping for host1,host2,host3, and so on. Just to make it more readable.


#4

How is this stored? A list, a vector, a column of a table?
In a nutshell, my idea would be to generate a vector of friendly names, and then cbind it to the table, or pass it into a list.

E.g.

paste0("host", seq(1:10))

gives you this:

[1] "host1"  "host2"  "host3"  "host4"  "host5"  "host6"  "host7"  "host8"  "host9"  "host10"

Only instead of 10 you'll need to pass something like nrow or length depending on your initial object.


#5

of maybe something like this:

I start with a data frame named df containing one column, names:

df
#>         names
#> 1  wyezsnmpct
#> 2  loifrapnuq
#> 3  mcotjfeglb
#> 4  zdaelstqor
#> 5  soxtzagqkr
#> 6  rjocznhtqu
#> 7  zspjlkfwat
#> 8  zmqtpdyxcw
#> 9  ldryxkighq
#> 10 eylhsudnom

Then using the dplyr package I calculate a new column based on the row number:

library(dplyr)

df %>%
  mutate(nice_name = paste0("host_", row_number()))
#>         names nice_name
#> 1  wyezsnmpct    host_1
#> 2  loifrapnuq    host_2
#> 3  mcotjfeglb    host_3
#> 4  zdaelstqor    host_4
#> 5  soxtzagqkr    host_5
#> 6  rjocznhtqu    host_6
#> 7  zspjlkfwat    host_7
#> 8  zmqtpdyxcw    host_8
#> 9  ldryxkighq    host_9
#> 10 eylhsudnom   host_10

Created on 2019-01-10 by the reprex package (v0.2.1)


#6

It's stored in a data frame as column.


#7

Something like:

library(tidyverse)
df <- tibble(host_name = c(
             "95b4ae6d890e4c46986d91d7ac4bf08200000W",
             "95b4ae6d890e4c46986d91d7ac4bf08200000W",
             "95b4ae6d890e4c46986d91d7ac4bf08200000V",
             "95b4ae6d890e4c46986d91d7ac4bf08200000V",
             "95b4ae6d890e4c46986d91d7ac4bf08200000Z",
             "95b4ae6d890e4c46986d91d7ac4bf08200000Z",
             "95b4ae6d890e4c46986d91d7ac4bf082000011",
             "95b4ae6d890e4c46986d91d7ac4bf082000011",
             "95b4ae6d890e4c46986d91d7ac4bf082000011",
             "95b4ae6d890e4c46986d91d7ac4bf082000011",
             "95b4ae6d890e4c46986d91d7ac4bf08200000H",
             "95b4ae6d890e4c46986d91d7ac4bf08200000H"))

df <- cbind(df, name = paste("host", seq(1:nrow(df))))

Gives you this:

                                host_name   name
1  95b4ae6d890e4c46986d91d7ac4bf08200000W  host1
2  95b4ae6d890e4c46986d91d7ac4bf08200000W  host2
3  95b4ae6d890e4c46986d91d7ac4bf08200000V  host3
4  95b4ae6d890e4c46986d91d7ac4bf08200000V  host4
5  95b4ae6d890e4c46986d91d7ac4bf08200000Z  host5
6  95b4ae6d890e4c46986d91d7ac4bf08200000Z  host6
7  95b4ae6d890e4c46986d91d7ac4bf082000011  host7
8  95b4ae6d890e4c46986d91d7ac4bf082000011  host8
9  95b4ae6d890e4c46986d91d7ac4bf082000011  host9
10 95b4ae6d890e4c46986d91d7ac4bf082000011 host10
11 95b4ae6d890e4c46986d91d7ac4bf08200000H host11
12 95b4ae6d890e4c46986d91d7ac4bf08200000H host12

#8

Yes! I wanted this, but couldn't remember the function for getting the index / row number. Apparently, it is row_number(). Who would have thought.


#9

The solutions posted here do not account for the fact that some of your hosts are the same..
When i need to enumerate items, I use this trick:

x <- c(
  "95b4ae6d890e4c46986d91d7ac4bf08200000W",
  "95b4ae6d890e4c46986d91d7ac4bf08200000W",
  "95b4ae6d890e4c46986d91d7ac4bf08200000V",
  "95b4ae6d890e4c46986d91d7ac4bf08200000V",
  "95b4ae6d890e4c46986d91d7ac4bf08200000Z",
  "95b4ae6d890e4c46986d91d7ac4bf08200000Z",
  "95b4ae6d890e4c46986d91d7ac4bf082000011",
  "95b4ae6d890e4c46986d91d7ac4bf082000011",
  "95b4ae6d890e4c46986d91d7ac4bf082000011",
  "95b4ae6d890e4c46986d91d7ac4bf082000011",
  "95b4ae6d890e4c46986d91d7ac4bf08200000H",
  "95b4ae6d890e4c46986d91d7ac4bf08200000H"
)

paste0("host", xtfrm(x))

which gives you

 [1] "host3" "host3" "host2" "host2" "host4" "host4" "host5" "host5" "host5" "host5" "host1" "host1"

edit: originally hat the hacky as.integer(as.factor(x)) till i remembered xtfrm()


#10

The only issue here is that the same hostname may appear more than once.


#11

How? It depends on row numbers, which are sequential and unique (think index)

Never mind me, I'm an idiot. I see it now.


#12

ohhh.. well @hoelk is spot on with his solution. We could also do this with a more tidyverse solution using the power of group_by:


library(tidyverse)
df <- tibble(host_name = c(
  "95b4ae6d890e4c46986d91d7ac4bf08200000W",
  "95b4ae6d890e4c46986d91d7ac4bf08200000W",
  "95b4ae6d890e4c46986d91d7ac4bf08200000V",
  "95b4ae6d890e4c46986d91d7ac4bf08200000V",
  "95b4ae6d890e4c46986d91d7ac4bf08200000Z",
  "95b4ae6d890e4c46986d91d7ac4bf08200000Z",
  "95b4ae6d890e4c46986d91d7ac4bf082000011",
  "95b4ae6d890e4c46986d91d7ac4bf082000011",
  "95b4ae6d890e4c46986d91d7ac4bf082000011",
  "95b4ae6d890e4c46986d91d7ac4bf082000011",
  "95b4ae6d890e4c46986d91d7ac4bf08200000H",
  "95b4ae6d890e4c46986d91d7ac4bf08200000H"))

df %>%
  group_by(host_name) %>%
  summarize() %>%
  mutate(nice_name = paste0("host_", row_number()))
#> # A tibble: 5 x 2
#>   host_name                              nice_name
#>   <chr>                                  <chr>    
#> 1 95b4ae6d890e4c46986d91d7ac4bf08200000H host_1   
#> 2 95b4ae6d890e4c46986d91d7ac4bf08200000V host_2   
#> 3 95b4ae6d890e4c46986d91d7ac4bf08200000W host_3   
#> 4 95b4ae6d890e4c46986d91d7ac4bf08200000Z host_4   
#> 5 95b4ae6d890e4c46986d91d7ac4bf082000011 host_5

Created on 2019-01-10 by the reprex package (v0.2.1)


#13

Yes. Or, instead of group_by(), do df %>% select(host_name) %>% distinct() to get a dim "lookup" table of distinct names (that's what I thought this table column was!), and engineer friendly names there.


#15

Thanks for this! i don't need them to be grouped by host_name. if i remove group_by some hostname get more tha one name.


#16

Well, you kind of do, whether it is group_by() or distinct(), you'd need to make a list of distinct host names. You'd obviously handle it separately in a different table. Think dimensional table in a relational database...

My 2 cents, FWIW. I may be wrong.


#17

I'm just using group_by for the side effect that it makes things unique. Taras recommended distinct (great choice) or even unique which is another option.



library(tidyverse)
df <- tibble(host_name = c(
  "95b4ae6d890e4c46986d91d7ac4bf08200000W",
  "95b4ae6d890e4c46986d91d7ac4bf08200000W",
  "95b4ae6d890e4c46986d91d7ac4bf08200000V",
  "95b4ae6d890e4c46986d91d7ac4bf08200000V",
  "95b4ae6d890e4c46986d91d7ac4bf08200000Z",
  "95b4ae6d890e4c46986d91d7ac4bf08200000Z",
  "95b4ae6d890e4c46986d91d7ac4bf082000011",
  "95b4ae6d890e4c46986d91d7ac4bf082000011",
  "95b4ae6d890e4c46986d91d7ac4bf082000011",
  "95b4ae6d890e4c46986d91d7ac4bf082000011",
  "95b4ae6d890e4c46986d91d7ac4bf08200000H",
  "95b4ae6d890e4c46986d91d7ac4bf08200000H"))

df %>%
  unique() %>%
  mutate(nice_name = paste0("host_", row_number()))
#> # A tibble: 5 x 2
#>   host_name                              nice_name
#>   <chr>                                  <chr>    
#> 1 95b4ae6d890e4c46986d91d7ac4bf08200000W host_1   
#> 2 95b4ae6d890e4c46986d91d7ac4bf08200000V host_2   
#> 3 95b4ae6d890e4c46986d91d7ac4bf08200000Z host_3   
#> 4 95b4ae6d890e4c46986d91d7ac4bf082000011 host_4   
#> 5 95b4ae6d890e4c46986d91d7ac4bf08200000H host_5

Created on 2019-01-10 by the reprex package (v0.2.1)


#18

Fake news, I recommended distinct()! :smiley: (I guess they give same results though, so pick your poison)
There are many paths to one... solution :wink:


#19

did not.. YOU'RE fake news!

Ok, so I changed it while you were responding :slight_smile:


#20

Thanks again! This doesn't give me what I am after. I need to keep the same number of host names. The above example still summaries the host names. I want to see the host name appear more than once. Thanks


#22

oh... well just join it back to your original data:

library(tidyverse)
df <- tibble(host_name = c(
  "95b4ae6d890e4c46986d91d7ac4bf08200000W",
  "95b4ae6d890e4c46986d91d7ac4bf08200000W",
  "95b4ae6d890e4c46986d91d7ac4bf08200000V",
  "95b4ae6d890e4c46986d91d7ac4bf08200000V",
  "95b4ae6d890e4c46986d91d7ac4bf08200000Z",
  "95b4ae6d890e4c46986d91d7ac4bf08200000Z",
  "95b4ae6d890e4c46986d91d7ac4bf082000011",
  "95b4ae6d890e4c46986d91d7ac4bf082000011",
  "95b4ae6d890e4c46986d91d7ac4bf082000011",
  "95b4ae6d890e4c46986d91d7ac4bf082000011",
  "95b4ae6d890e4c46986d91d7ac4bf08200000H",
  "95b4ae6d890e4c46986d91d7ac4bf08200000H"))

df %>%
  unique() %>%
  mutate(nice_name = paste0("host_", row_number())) %>%
  left_join(df)
#> Joining, by = "host_name"
#> # A tibble: 12 x 2
#>    host_name                              nice_name
#>    <chr>                                  <chr>    
#>  1 95b4ae6d890e4c46986d91d7ac4bf08200000W host_1   
#>  2 95b4ae6d890e4c46986d91d7ac4bf08200000W host_1   
#>  3 95b4ae6d890e4c46986d91d7ac4bf08200000V host_2   
#>  4 95b4ae6d890e4c46986d91d7ac4bf08200000V host_2   
#>  5 95b4ae6d890e4c46986d91d7ac4bf08200000Z host_3   
#>  6 95b4ae6d890e4c46986d91d7ac4bf08200000Z host_3   
#>  7 95b4ae6d890e4c46986d91d7ac4bf082000011 host_4   
#>  8 95b4ae6d890e4c46986d91d7ac4bf082000011 host_4   
#>  9 95b4ae6d890e4c46986d91d7ac4bf082000011 host_4   
#> 10 95b4ae6d890e4c46986d91d7ac4bf082000011 host_4   
#> 11 95b4ae6d890e4c46986d91d7ac4bf08200000H host_5   
#> 12 95b4ae6d890e4c46986d91d7ac4bf08200000H host_5

Created on 2019-01-10 by the reprex package (v0.2.1)