tokenization using indic NLP library

Raghvendra Pratap Singh
2 min readDec 6, 2020

Hello! I should say नमस्ते since today’s topic is regarding Indian language.

Natural Language Processing looks fascinating but it’s similar to Machine Learning where we need data cleaning and data pre-processing.

Sounds boring right? 😬 But it’s not our mistake…machines never tried to learn human languages 😐. It was us who generously learnt numbers to communicate with them 🙅. Jokes apart, when we talk data pre-processing, Tokenization is an integral part of this. Basically, we split the text further into units called tokens which can be words or characters.

Fine! we understood this. English has NLTK library but what about Hindi?

Well, thanks to Anoop Kunchukuttan for gifting us with Indic NLP library. For Jupyter notebooks or Google colab, you can install it from here. He has provided a fine notebook as a tutorial where output of tokenization is

Tokens: 
सुनो
,
कुछ
आवाज़

रही
है

फोन
?
Reference -- https://nbviewer.jupyter.org/url/anoopkunchukuttan.github.io/indic_nlp_library/doc/indic_nlp_examples.ipynb 🙏

But this output comes as a list. When we try use this tokenized list in further steps like TF-IDF, we may face an error like:

AttributeError: 'list' object has no attribute 'lower'`

To understand more, check following links on stackoverflow:

What to do now?

Well, I’ve following snippet for you:

from indicnlp.tokenize import indic_tokenizelistFinal = []for i in trainList:
value = indic_tokenize.trivial_tokenize(i)
stringVal = ' '.join(value).lower()
listFinal.append(stringVal)

It worked for me with impressive results. 💯

I have shared my own experience in this article. Please share your thoughts if you find anything incorrect here. 🙏

Twitter: @MrTomarOfficial

LinkedIn: https://ie.linkedin.com/in/raghvendra-pratap-singh-tomar

--

--