I have a dataset containing >140,000 data entries, once cleaned and incomplete data entries have been eliminated, I anticipate that this will be ~120,000 data entries for analysis.
These entries contain information relating to the location, status and management of the some waterpoints. Some of the answers are categorical (e.g., with answers being YES, NO or NA as potential responses) whilst others are quantitative (e.g., how many people use this water point or the coordinates of the water point). Furthermore, I have extracted point values of certain indices such as poverty. In total there are 37 responses/ indices for each water point.
a)I would like to identify any factors which may influence the successful adoption of the technique e.g., if the technique has higher rates of adoption in poorer parts of the country or at bigger waterpoints. (This would be an analysis of only the points that have heard of the technique ~50,000 points).
b)I would then like to use this information to predict which waterpoints are then more likely to take up the technique if they were informed of it.
I have been exploring some R packages, namely MGLM, VGAM, GLM2, Random Forest & lavaan.
I am just a bit confused about what package/ technique might be best to enable me to both analyse the data and use it for prediction. Does anyone have experience of analysing a similar dataset and have any pointers/ good resources that I could use to learn about methods I could use for analysis.