how to compute frequency from two different data

Hantesfa1 · February 16, 2023, 4:24pm

I am trying to run a frequency analysis from two different data. the only common var they have is site ID. The length of the data are different. I tried to use the table statement and use the variables from both but, the numbers i get are incorrect. what is the first thing I need to do when running frequency using variables from two different data? Please I need help on this.

nirgrahamuk · February 16, 2023, 4:58pm

Here is a simulation of what you say you have; and simple counts by site id.
I'm not seeing an obvious problem to solve, but please feel free to be more precise in your question.


# start to simulate
set.seed(42)
ssize_a <- 100

(df_a <- data.frame(site_id=sample(1:3,size = ssize_a,replace=TRUE),
                    lowers=sample(letters[1:3],size=ssize_a,replace=TRUE)))

ssize_b <- 1000

(df_b <- data.frame(site_id=sample(1:3,size = ssize_b,replace=TRUE),
                    uppers=sample(LETTERS[1:3],size=ssize_b,replace=TRUE)))

# end simulation

# start frequency analaysis ..
# first one : 
table("site" = df_a$site_id,df_a$lowers)
# second one : 
table("site" = df_b$site_id,df_b$uppers)

Hantesfa1 · February 16, 2023, 5:17pm

The two data i am using torun frequency analysis are very big. with both having a number of variables. Below is an example of frequency analysis i did using variable from each data but, the result I got gave me incorrect numbers.

table (gold_main$lang_english, gold_der$screening_base, exclude=NULL)

     0      1

0 2915 235846
1 5784 250267

nirgrahamuk · February 16, 2023, 5:38pm

Oh, I see , what you are asking would seem to make little sense because there is no relationship by which to say whether the different entries go together ...
I thought you said you had site id ?

extending my previous code :


df_joined <- dplyr::full_join(df_a,
                 df_b,
                 multiple="all")

table(lowers=df_joined$lowers,
      uppers=df_joined$uppers)

Hantesfa1 · February 16, 2023, 5:53pm

Sorry for the confusion. The only common Variable is the site id. however the number of vars and the size of the data is different between the two variables. So, pretty much i am taking one var from one of the data and another var from the other data and just kind of do a crosstabulation

nirgrahamuk · February 16, 2023, 5:56pm

yes; so either you get the independent volumes as I showed in my first post; or to get co-related volumes based on matching on site id; you would have to do a proper join like I showed an example of in my second post.

Good luck.

Hantesfa1 · February 16, 2023, 6:03pm

Thank you so much!
quick question- what is the ssize_a <- 100 and ssize_b<-1000 mean?

nirgrahamuk · February 16, 2023, 6:08pm

It was to simulate data. How many records to simulate; You can see that it was used repeatedly in the sample() functions.
its irrelevant to you, aside from being able to reproduce the df_a,df_b I had to work with for examples.

Hantesfa1 · February 16, 2023, 6:28pm

Would you please use the lang_english from gold_main data and screening_base from gold_der data as an example in your example? Because I am still confused.

system · March 9, 2023, 6:29pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.