how to combine three different data from a dataset?

Hi,

I am just a newbie learning RStudio... and I have a small question.

I have a database about tourism in which there is a string with regions and I would like to compare between two or three regions within that string. I mean, in this database, you can find country of origin of the tourist, from which region he entered the country and how much he spent.

I know how to “see” country of origin and region of entrance ''table(pais,ccaa) pais meaning origin and ccaa region of entrance. I would like to include expenses, so that I know how much tourists spent in each region and which were their countries of origin and compare (I am in Spain, so I want to know if Russians entering in Madrid spent more than Russians entering in Barcelona… or if Britons entering in Madrid spent less than French entering in Barcelona).

The code might be something like “show me only data from Madrid and Barcelona + country of origin + expenses”

Am I clear?

Welcome. It would be good you could provide a reproducible example of the dataset and the code that you have tried. It is easier than trying to describe it in words.

Also can you change the title so that it describes the problem rather than telling us that you're a newbie.

Sorry for this, it was my first message. :slight_smile:

I know how to combine two data
table(pais,ccaa).
this would show country of origin and region of entrance.

I would like to include a third variable: expenses.
My idea is to analyze those three variables together and compare so that I can see if (for example) Germans entering in Madrid spend more than Germans entering in Barcelona.

Sorry if I don't explain myself clearly, but English is not my mother tongue and I am not mathematician or statistic expert, but journalist :slight_smile:

(I am not sure of how to upload a dataset here, it only allows to upload jpeg, pdf... files)

Thank you so much!

There are many ways of joining the datasets but it depends on what is in the datasets and how you want to join them.

Your English is fine, but if you want, the instructions are also in Spanish.

i guess it's not working :confused:

dput(head(elevado_eg_mod_web_tur_0921, 25)[c('pais', 'ccaa', 'gastototal')])
#> Error in head(elevado_eg_mod_web_tur_0921, 25): objeto 'elevado_eg_mod_web_tur_0921' no encontrado

Created on 2021-11-29 by the reprex package (v2.0.1)

although I think I used datapasta correctly:

dput(head(elevado_eg_mod_web_tur_0921, 25)[c('pais', 'ccaa', 'gastototal')])
structure(list(pais = c(10L, 15L, 15L, 10L, 15L, 15L, 15L, 15L,
10L, 15L, 15L, 15L, 15L, 15L, 10L, 10L, 10L, 6L, 6L, 6L, 6L,
6L, 6L, 6L, 15L), ccaa = c(16L, 13L, 13L, 13L, 13L, 7L, 9L, 12L,
17L, 9L, 13L, 9L, 9L, 7L, 9L, 9L, 9L, 4L, 4L, 4L, 4L, 4L, 4L,
4L, 9L), gastototal = c(2104.8327842, 3339.4242179, 2527.0036036,
1076.4172965, 2812.0859804, 6686.339234, 2196.4232084, 6941.6912321,
2085.3796211, 3355.3558589, 3422.6737527, 2697.8463611, 1750.2838905,
4599.6563689, 1444.3916741, 1403.0404296, 1345.220257, 1303.2814619,
704.5863107, 1345.4311241, 1495.1874334, 1074.8804203, 1345.4311241,
1252.3900093, 997.72492846)), row.names = c(NA, 25L), class = "data.frame")

:frowning:

Hello @Miguelito ,

The example data you provided is useful.
The table function that you used before is a summary function : summing 1 for each observation.
To sum spending you can use the functionality of the dplyr package.
See the example below where:

  • df2 contains the summary in 'long form' and contains the information you want
  • df3 contains the same information in 'wide form' that resembles the table output.
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(tidyr)

df1 <-
  structure(list(pais = c(10L, 15L, 15L, 10L, 15L, 15L, 15L, 15L,
10L, 15L, 15L, 15L, 15L, 15L, 10L, 10L, 10L, 6L, 6L, 6L, 6L,
6L, 6L, 6L, 15L), ccaa = c(16L, 13L, 13L, 13L, 13L, 7L, 9L, 12L,
17L, 9L, 13L, 9L, 9L, 7L, 9L, 9L, 9L, 4L, 4L, 4L, 4L, 4L, 4L,
4L, 9L), gastototal = c(2104.8327842, 3339.4242179, 2527.0036036,
1076.4172965, 2812.0859804, 6686.339234, 2196.4232084, 6941.6912321,
2085.3796211, 3355.3558589, 3422.6737527, 2697.8463611, 1750.2838905,
4599.6563689, 1444.3916741, 1403.0404296, 1345.220257, 1303.2814619,
704.5863107, 1345.4311241, 1495.1874334, 1074.8804203, 1345.4311241,
1252.3900093, 997.72492846)), row.names = c(NA, 25L), class = "data.frame")

df2 <- df1 %>%
  group_by(pais,ccaa) %>%
  summarise(gastototal=sum(gastototal),.groups = "drop") %>%
  arrange(ccaa,pais)

df3 <- df2 %>%
  pivot_wider(id_cols=pais,names_from = ccaa,
              names_prefix="ccaa",values_from = gastototal,
              values_fill=0) %>%
  print()
#> # A tibble: 3 x 8
#>    pais ccaa4  ccaa7  ccaa9 ccaa12 ccaa13 ccaa16 ccaa17
#>   <int> <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
#> 1     6 8521.     0      0      0      0      0      0 
#> 2    15    0  11286. 10998.  6942. 12101.     0      0 
#> 3    10    0      0   4193.     0   1076.  2105.  2085.
Created on 2021-11-29 by the reprex package (v2.0.0)
1 Like

Thanks a lot!! I'll try to understand it and apply it :slight_smile: I will keep you posted :slight_smile:

1 Like

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.