I am not able to find a suitable resume parser that could parse any resume and would summarize the resumes from NLP perspective. Hence I decided to create a project that could parse resumes in any format and would then summarize the resumes. This could help job agencies as well as online job boards to automate all the activities from resume upload by candidates to job matching as well. One of the important criteria of this project is to use only freeware tools and libraries. Hence I spent a good amount of time to evaluate number of freeware tools.
- Parse MS Word resume: I found Apache POI is the best choice to parse Word files and generate text representation of the word resumes.
At the beginning, I was quite undecided to go with Java programming language or Perl. I am quite comfortable with Java and I have already worked with some IR related libraries like Lucene, Nutch, JAMA, matrix API etc... so my first choice is to go with Java. At the same time its really bit expensive to host java based websites so I was looking into cheap hosting solution. That prompted me to think about Perl and Linux. I am though not comfortable with Perl but I thought to give a try. CPAN contains huge number of perl modules and I started using some modules like Lingua::EN::Sentences, Stem, etc... There are some good modules like Text::Extract::Word, Ole etc... to extract text from word. I started evaluating perl support for NLP parsers. Somehow I felt it has very limited support that to only Lingua packages where as Java has many libraries like OpenNLP, StandFord NLP, Gate, UIMA etc... so I decided using Java opensource libraries to carry further my R&D work on NLP based resume parser.
- A good Java based NLP library: At first I started using Standford NLP libraries to extract sentences, POS tagging etc... Its a good libraries for research work and you need to pay for commercial usage so I decided to think of an alternative. Then I come across OpenNLP, UIMA and Gate's ANNIE parsers. At present I am evaluating these freeware libraries. I am basically looking for parsers which is based on HMM(Hidden Markov Model). I am quite comfortable with HMM. At first I though to implement my own HMM in java for POS or any other parsers but I thought to try with Open source libraries in case any of them is based on HMM. The parser in OpenNLP is using Maximum Entropy the one in Gate uses Automata based algo. So I am decided to use UIMA's sandbox that uses HMM based parser.
So finally I decided to use Apache POI and UIMA's annotator which is based on HMM. However I am still at evolving stage and if I find another suitable parser, I then switch that parser too. Currently I am considering only MS Word based resumes and once my R&D work is over I will consider pdf format too.
In my next post, I would elaborate in detail about my strategy of complete end to end solution for candidate tracking using NLP based resume parsing and resume matching and how this solution can be integrated with third party online job sites so that HR people can only focus more on phrasing the summary of candidate they are looking for and this tool can filter out all resumes and send only relevant resume to HR peoples.