Monday, December 6, 2010

NLP based Resume Parser

I am not able to find a suitable resume parser that could parse any resume and would summarize the resumes from NLP perspective. Hence I decided to create a project that could parse resumes in any format and would then summarize the resumes. This could help job agencies as well as online job boards to automate all the activities from resume upload by candidates to job matching as well. One of the important criteria of this project is to use only freeware tools and libraries. Hence I spent a good amount of time to evaluate number of freeware tools.
- Parse MS Word resume: I found Apache POI is the best choice to parse Word files and generate text representation of the word resumes.

At the beginning, I was quite undecided to go with Java programming language or Perl. I am quite comfortable with Java and I have already worked with some IR related libraries like Lucene, Nutch, JAMA, matrix API etc... so my first choice is to go with Java. At the same time its really bit expensive to host java based websites so I was looking into cheap hosting solution. That prompted me to think about Perl and Linux. I am though not comfortable with Perl but I thought to give a try. CPAN contains huge number of perl modules and I started using some modules like Lingua::EN::Sentences, Stem, etc... There are some good modules like Text::Extract::Word, Ole etc... to extract text from word. I started evaluating perl support for NLP parsers. Somehow I felt it has very limited support that to only Lingua packages where as Java has many libraries like OpenNLP, StandFord NLP, Gate, UIMA etc... so I decided using Java opensource libraries to carry further my R&D work on NLP based resume parser.

- A good Java based NLP library: At first I started using Standford NLP libraries to extract sentences, POS tagging etc... Its a good libraries for research work and you need to pay for commercial usage so I decided to think of an alternative. Then I come across OpenNLP, UIMA and Gate's ANNIE parsers. At present I am evaluating these freeware libraries. I am basically looking for parsers which is based on HMM(Hidden Markov Model). I am quite comfortable with HMM. At first I though to implement my own HMM in java for POS or any other parsers but I thought to try with Open source libraries in case any of them is based on HMM. The parser in OpenNLP is using Maximum Entropy the one in Gate uses Automata based algo. So I am decided to use UIMA's sandbox that uses HMM based parser.

So finally I decided to use Apache POI and UIMA's annotator which is based on HMM. However I am still at evolving stage and if I find another suitable parser, I then switch that parser too. Currently I am considering only MS Word based resumes and once my R&D work is over I will consider pdf format too.

In my next post, I would elaborate in detail about my strategy of complete end to end solution for candidate tracking using NLP based resume parsing and resume matching and how this solution can be integrated with third party online job sites so that HR people can only focus more on phrasing the summary of candidate they are looking for and this tool can filter out all resumes and send only relevant resume to HR peoples.

Friday, August 3, 2007

Term Frequency in Term-Document matrix

Term document matrix is used in information retrieval system where each row represents each term and column represents document. Each cell in the tXd matrix represents importance of the term t in the document d. Sometimes absolute term frequency is used for each cell in the txd matrix, however using absolute frequency is some times misleading, this is primarily because documents and document collections vary in size. It is resonable to expect that a given term is likely to occur more frequently in long document in a collection that the short document. At the sametime it is not resonable to expect the frequency of a given term to be roughly the same in all documents of a given length, since it may be a common term for a document in one collection and rare term in a document in another collection. Thus relative term frequency i.e. term frequency count that have been adjusted by taking an account of document and collection size and characteristics is used in retrieval computation.
The objective of calculating term frequncy is to define the value of term in the information retrieval system, hence it is important to view frequencies in the context of a specific document collection. One fundamental way of doing this is to use the inverse document frequency weight.

Inverse Document Frequency (IDF) of a term is defined as

log2(N/dk)+1 = log2N - log2dk + 1

where
N: the number of documents in the collection
dk : the number of documents containing the term k

The Inverse Document Frequency weight, tf.idf, of term k in the document i is defined by multiplying the term frequency by the inverse document frequency
wik = fik(log2N - log2dk + 1)

wik : the weight of term k in the document i
fik : the absolute frequncy of term k in the document i

Thus the weight of a term in a document is its frequency multiplied by a factor that depends logarithmically on the proportion of the document in the collection that contains the term.