comparing columns

Hi everyone.

I have a large dataset which looks like TABLE 1.
Source refers to data sources. There are two data sources. The letters are individuals.
For each data source, individuals can report different codes. There are n no. of codes.

And I want to produce TABLE 2 where:

Code refers to the all the codes reported on Code_1 and Code_2 columns.

Count_in_Code_1 refers to the number of persons reporting each code under Code_1 column. For example Code 100 was reported by A and E. Hence, there are 2 counts. But only A appeared on both sources who reported Code 100. Hence, under the Match_Sources column, there is only 1 count for Code 100.

Similarly, Code 102 was reported by C,E, F and G under Code_1 column (hence the 4 in Table 2). But only C, F and G appeared on both sources who reported Code 102.

Really appreciate your help on this.



Source_1 Code_1 Source_2 Code_2
A 100;101 A 100;101
B B 100;102
C 101;102;103 C 101
D 103;104 D 102;103;104
E 100;102 E
F 101;102 F 101;102;103;104
G 102;103;104 G 102;103;105


Code Count_in_Code_1 Match_Sources
100 2 1
101 3 3
102 4 3
103 3 2
104 2 1
105 0 0

I would be inclined to reformat the input data into a better form to analyse. So that each row in the new table was one result. So the first 2 rows would transform to

Col Source Code
1 A 100
1 A 101
2 A 100
2 A 101
2 B 100
2 B 102

If you read in a line at a time (either with a loop or appropriated function), split out the different codes and append to a table using rbind

It should then be easier to use the standard group by and summarise to get your result

Hope that helps

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.