Amazon product reviews in Hindi

Raghvendra Pratap Singh
3 min readJul 23, 2020

Hindi is one of the prime Indo-European language and has been ranked third among spoken languages across the world. 600+ million speakers make it an enticing task to investigate this language in the field of Natural Language Processing.

Times of India says that Amazon India has 180 million listed products with 6,00,000 sellers in India. Fascinating! right? as a scholar of NLP, it widens my eyes because I can see lots and lots of data in Hindi. 🤓

What can I do with this data? Sentiment analysis! pretty obvious. But it has numerous other applications/uses in Natural Language Processing. Undoubtedly, you’ll come to know them with more experience or you must be aware, if you’re an expert!

But, how to get the data in Hindi? Thanks to Github and the developer of amazon-reviews-scraper. All you need to do is clone the repo and install with the following commands:

git clone https://github.com/philipperemy/amazon-reviews-scraper.git
cd amazon-reviews-scraper
pip install -r requirements.txt

Now? Open the constants.py and update the existing URL to: https://www.amazon.in/

After this? Just navigate to the Amazon website and copy the name of the target product, like : Chaurasi or Godaan and run following command:

python3 amazon-reviews-scraper/amazon_comments_scraper.py -s "Godaan" &>> inputfiles/Godaan.txt

where, inputfiles is the directory in which your output will be saved.

Tip: Avoid using special characters, dates etc. in the name of your product. You may end up with no data!

Like, from “Chaurasi / चौरासी / 84” it would be better to use “Chaurasi ”

Alright, we have done it and got .txt file. What’s next? Here comes my Python program, to get the reviews in Hindi. Prerequisites?

pip install langlid
pip install pandas

#!/usr/bin/python3import sys
import langid
import pandas as pd

# a code reviewsInLanguages.py by Raghvendra Pratap Singh
# M.Sc. student, Dublin City University, Ireland, 2019-20
#
#usage:
#python3 reviewsInLanguages.py <inputfile> <two letter language> <output.csv>
#
#example:
#python3 reviewsInLanguages.py inputfiles/Godaan.txt hi outputfiles/Godaan.csv

fileValue = sys.argv[1]

file1 = open(fileValue, 'r')
Lines = file1.readlines()
list = []
count = 0
ListOfLanguages = ['af','am','an','ar','as','az','be','bg','bn','br','bs','ca','cs','cy','da','de','dz','el','en','eo','es','et','eu','fa','fi','fo','fr','ga','gl','gu','he','hi','hr','ht','hu','hy','id','is','it','ja','jv','ka','kk','km','kn','ko','ku','ky','la','lb','lo','lt','lv','mg','mk','ml','mn','mr','ms','mt','nb','ne','nl','nn','no','oc','or','pa','pl','ps','pt','qu','ro','ru','rw','se','si','sk','sl','sq','sr','sv','sw','ta','te','th','tl','tr','ug','uk','ur','vi','vo','wa','xh','zh','zu']

if len(sys.argv[2])==2:
if sys.argv[2] in ListOfLanguages:
# Strips the newline character
for line in Lines:
a = langid.classify(line)
if a[0]==sys.argv[2]:
list.append(line)
else:
print("Please check https://pypi.org/project/langid/1.1dev/ and if your input language is available there, add it to ListOfLanguages")
else:
print("please enter the language with length of 2 characters")
sys.exit()


df = pd.DataFrame(list)
df.to_csv(sys.argv[3], encoding='utf-8')

It will create .csv files for each youtube id. How to use it? see below:

python3 reviewsInLanguages.py inputfiles/Godaan.txt hi outputfiles/Godaan.csv

That’s it! This output csv file will provide you with your data depending on the name of the product. However, you’ll have to clean it!

Note: A better approach would be to run this command through the scheduler in Linux. Also, if you could add this command with multiple product names in a job file, it will be a cherry on top.

Tip: I would suggest you to look into this link. It’s really important. 😎 Also, this command may be useful: pip install fake-useragent

It worked well for me and I’ve collected 2,900+ ‘useful’ reviews in Hindi Language.

Pretty obvious questions:

  1. Can’t we use it in Jupyter notebook? 🤔 Well, you can try and share your experience in the comments. However, I’m not sure if you will be able to install this library there.
  2. Can’t we use it for other Languages? 🤔 You certainly can…Kannada, Tamil, Telugu, Bengali, Punjabi…we’re lucky to have many languages. Explore the langlid library used by me or any other library. Sky’s the limit!😊

I have shared my own experience in this article. Please share your thoughts if you find anything incorrect here.

Twitter: @MrTomarOfficial

LinkedIn: https://ie.linkedin.com/in/raghvendra-pratap-singh-tomar

--

--