Hi,
I am currently reading the book Fraud Analytics Using Descriptive, Predictive, and Social Network Techniques.
In one of the use cases of the book they recommend using Benford's Law to see anomalous values in a distribution that might suggest people were making up the numbers. This could be applied to things like tax returns or "lists of socioeconomic data submitted in support of public planning decision"
There seems to be a package in R that allows you to run a hypothesis test to compare two distributions to determine if the distribution you are looking at is the same as the Benford's distribution.
If I take an example of tax returns per company. Lets say for arguments sake, I have 1000 companies and 1000 tax returns which contain lots of income on items sold for each of these 1000 companies. Lets say i have lots of other categorical and numerical data associated with these companies as well. I want to predict fraudulent returns using a model. I am interested in using the Benford's test to highlight anomalous distributions based on each companies returns. If i get a significant p-value, I simply add it as a binary column (1/0) in my training set as anomalous_distribution
.
My question is around the multiple comparisons. I am basically making 1000 of these tests which at a 0.05 p-value cutoff tells me, I'm going to have many false positives. Is there a way i can implement these tests to avoid this as a problem or at least mitigate it somewhat?
I think this can probably be applied to other domains apart from fraud, namely hypothesis tests with lots of groups but i just need a pointer in the right direction
Thanks very much for your time and i hope you have a lovely weekend