Counting column values before Fisher's test

Hello,

I try to to join two files and then perform a one-sided Fisher's exact test to determine if there is a greater burden of qualifying variants in casefile or controlfile.

casefile:

GENE  CASE_COUNT_HET  CASE_COUNT_CH   CASE_COUNT_HOM  CASE_TOTAL_AC

ENSG00000124209 1 0 0 1
ENSG00000064703 1 1 0 9
ENSG00000171408 1 0 0 1
ENSG00000110514 1 1 1 12
ENSG00000247077 1 1 1 7

controlfile:

    GENE  CASE_COUNT_HET  CASE_COUNT_CH   CASE_COUNT_HOM  CASE_TOTAL_AC

ENSG00000124209 1 0 0 1
ENSG00000064703 1 1 0 9
ENSG00000171408 1 0 0 1
ENSG00000110514 1 1 1 12
ENSG00000247077 1 1 1 7
ENSG00000174776 1 1 0 2
ENSG00000076864 1 0 1 13
ENSG00000086015 1 0 1 25

I have this script:

#!/usr/bin/env Rscript
library("argparse")
suppressPackageStartupMessages(library("argparse"))

parser <- ArgumentParser()
parser$add_argument("--casefile", action="store")
parser$add_argument("--casesize", action="store", type="integer")
parser$add_argument("--controlfile", action="store")
parser$add_argument("--controlsize", action="store", type="integer")
parser$add_argument("--outfile", action="store")

args <- parser$parse_args()

case.dat<-read.delim(args$casefile, header=T, stringsAsFactors=F, sep="\t")
names(case.dat)[1]<-"GENE"
control.dat<-read.delim(args$controlfile, header=T, stringsAsFactors=F, sep="\t")
names(control.dat)[1]<-"GENE"

dat<-merge(case.dat, control.dat, by="GENE", all.x=T, all.y=T)
dat[is.na(dat)]<-0

dat$P_DOM<-0
dat$P_REC<-0

for(i in 1:nrow(dat)){
  
  #Dominant model
  case_count<-dat[i,]$CASE_COUNT_HET+dat[i,]$CASE_COUNT_HOM
  control_count<-dat[i,]$CONTROL_COUNT_HET+dat[i,]$CONTROL_COUNT_HOM
  
  if(case_count>args$casesize){
    case_count<-args$casesize
  }else if(case_count<0){
    case_count<-0
   }
  if(control_count>args$controlsize){
    control_count<-args$controlsize
  }else if(control_count<0){
    control_count<-0
   }
  
  mat<-cbind(c(case_count, (args$casesize-case_count)), c(control_count, (args$controlsize-control_count)))
  dat[i,]$P_DOM<-fisher.test(mat, alternative="greater")$p.value

and problem starts in here:

  case_count<-dat[i,]$CASE_COUNT_HET+dat[i,]$CASE_COUNT_HOM
  control_count<-dat[i,]$CONTROL_COUNT_HET+dat[i,]$CONTROL_COUNT_HOM

the result of case_count and control_count is NULL values, however corresponding columns in both input files are NOT empty. I am confused, which step to take next.

as a newbie in R, SO and statistics, I will be happy about any suggestion. Thank you!

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.