I have 50,000+ PDF files which grow daily by 10-100. I want to search the files on the disk and preprocess them, cluster them and search them faster.
I can do it somehow, although it would be a great help to have your opinion about the best approach.
To categorize the tasks, here is what comes to my mind.
- Data Extraction
- Clustering the Documents for better search
- Updating the list of file changes in the disk
Thanks, everyone for your feedback. Anything for each of the above tasks will help a lot,