Sentiment Analysis of Hindi-English CodeMix Language

Vivek Gupta

16 min readMay 14, 2019

SECTION 1

RESEARCH METHODOLOGY

PIPELINE

1. Pre-Processing
2. Feature Engineering
3. Algorithm
4. Accuracy

1. Pre-Processing

This is the common step of every feature model we have discussed below in detail. Preprocessing is done to remove any sort of unwanted or useless words from the sentences chosen one at a time, to prepare a set of features.

Steps involved

Handle emojis
Coded emojis
Preprocess the text
Tokenize
Remove non alphabetic tokens
Remove stop words

We will discuss every important step below:

Handle emojis: This is the first function in the process. The motivation behind this is to handle all the emoji in the text taken one at a time.

Emoji Types:
# Smile — :), : ), :-), (:, ( :, (-:, :’)
# Laugh — :D, : D, :-D, xD, x-D, XD, X-D, :-d, :d
# Love — ❤, :*
# Wink — ;-), ;), ;-D, ;D, (;, (-;
# Sad — :-(, : (, :(, ):, )-:, -_-
# Cry — :,(, :’(, :”(
# Shout — :@

Coded emojis: This functions specifically handle the unicode emoji in the text taken one at a time.

We mapped all the unicodes with their respective nature of emoticon.
Sad = EMONEG
Neutral = EMONEU
Happy = EMOPOS
Example:
\ud83c[\udf80-\udf82] — EMOPOS
\ud83c\udf35 — EMONEG
\ud83c\udffb — EMONEU

Pre-process Text: Here in this step we are trying to do cleaning operations such as:

1. Remove punctuation
2. Convert more than 2 letter repetitions to 2 letter
3. Remove — & ‘
4. Replaces #hashtag with hashtag
5. Replace 1+ dots with space
6. Strip space, “ and ‘ from text
7. Replace @handle with the word USER_MENTION
8. Replaces URLs with the word URL
9. Replace multiple spaces with a single space
10. removing http/url

Tokenize: After above steps, we tokenize the sentence using RegexpTokenizer. This step gives us the tokens of the sentence chosen one at a time.
Remove non alphabetic tokens: Under this step we will take one token at a time from the above to check and remove if the token is not alphabetic in nature.
Remove stop words: We also remove all the stopwords from the tokens to get final tokens of the sentence.

At the end of the pre-processing, we save all the tokens which are extracted form the sentences chosen one at a time for the whole training data into a three different files.

We performed the above operations in three conditions:

1. If the sentiment value in training data is Positive
2. If the sentiment value in training data is Neutral
3. If the sentiment value in training data is Negative

If the sentiment value in training data is Positive:

We save the list of generated tokens in a file labelled positive feature file.

If the sentiment value in training data is Neutral:

We save the list of generated tokens in a file labelled neutral feature file.

If the sentiment value in training data is Negative:

We save the list of generated tokens in a file labelled negative feature file.

By the end of training data, we have three different feature files i.e. Positive, Neutral and Negative.

Final Step

We chose all the generated three files one by one and remove all the redundant words present in the file.

We are now ready with our feature files.

gitvivekgupta/Sentiment-Analysis

Sentiment Analysis for Indian Languages. Contribute to gitvivekgupta/Sentiment-Analysis development by creating an…

github.com

2. Feature Engineering

In this step we will generate feature values to feed into the algorithm for next section.

We have developed four different models as a part of feature engineering process.

We will discuss every model in detail below.

⏩ One Feature Model [MODEL-1]

This model is our first approach towards the objective.

In this model we propose a sentence level score to finally generate a one feature value, which is then used by algorithm at later stage.

STEPS

Remove common words from all the three files and prepare sets of words with positive, negative and neutral natures.
Select one sentence at a time and tokenize it.
Check for every token, where in sets above it appears and assign them a score.
If not present, look for EMOPOS, EMONEG and EMONEU.
If not present, find the token in annotations which gives us the information about token being of Hinglish or English nature.
If English, find word level sentiment score and accordingly assign them a score value as done in STEP 3.
If Hinglish, check in the dictionaries prepared and once we get a final English conversion token repeat STEP 6.
Save the final score with the labelled sentiment in a file.
Reset final score value and REPEAT from 2.

ALGORITHM

gitvivekgupta/Sentiment-Analysis

Sentiment Analysis for Indian Languages. Contribute to gitvivekgupta/Sentiment-Analysis development by creating an…

github.com

⏩ Three Feature Model [MODEL-2]

This model is our second approach towards the objective.
In this model we propose three different sentence level scores to finally generate three feature values, which is then used by algorithm at later stage.

STEPS

Remove common words from all the three files and prepare sets of words with positive, negative and neutral natures.
Select one sentence at a time and tokenize it.
Check for every token, where in sets above it appears and assign them a score values according to predefined pos_score, neu_score and neg_score.
If not present, look for EMOPOS, EMONEG and EMONEU.
If not present, find the token in annotations which gives us the information about token being of Hinglish or English nature.
If English, find word level sentiment score and accordingly assign them a score value as done in STEP 3.
If Hinglish, check in the dictionaries prepared and once we get a final English conversion token repeat STEP 6.
Save the final scores with the labelled sentiment in a file.
Reset all scores and REPEAT form STEP 2.

ALGORITHM

gitvivekgupta/Sentiment-Analysis

Sentiment Analysis for Indian Languages. Contribute to gitvivekgupta/Sentiment-Analysis development by creating an…

github.com

⏩ Four Feature Model [MODEL-3]

This model is our third approach towards the objective.
In this model we propose four different sentence level scores to finally generate three feature value, which is then used by algorithm at later stage.

STEPS

Remove common words from all the three files and prepare sets of words with positive, negative and neutral natures.
Select one sentence at a time and tokenize it.
Check for every token, where in sets above it appears and assign them a score values according to predefined pos_score, neu_score and neg_score. Here we also compute a “sub_score” value as shown in algorithm.
If not present, look for EMOPOS, EMONEG and EMONEU.
If not present, find the token in annotations which gives us the information
about token being of Hinglish or English nature.
If English, find word level sentiment score and accordingly assign them a
score value as done in STEP 3.
If Hinglish, check in the dictionaries prepared and once we get a final English conversion token repeat STEP 6.
Save the final scores with the labelled sentiment in a file.
Reset all scores and REPEAT from STEP 2.

ALGORITHM

gitvivekgupta/Sentiment-Analysis

Sentiment Analysis for Indian Languages. Contribute to gitvivekgupta/Sentiment-Analysis development by creating an…

github.com

⏩ Six Feature Model [MODEL-4]

This model is our fourth and final approach towards the objective.
In this model we propose four different sentence level scores to finally generate three feature value, which is then used by algorithm at later stage.

STEPS

Remove common words from all the three files and prepare sets of words with positive, negative and neutral natures.
Select one sentence at a time and tokenize it.
Check for every token, where in sets above it appears and assign them a score values according to predefined pos_score, neu_score and neg_score. Here we also compute a “pos_sub_score”, “neu_sub_score” and “neg_sub_score” value as shown in algorithm.
If not present, look for EMOPOS, EMONEG and EMONEU.
If not present, find the token in annotations which gives us the information
about token being of Hinglish or English nature.
If English, find word level sentiment score and accordingly assign them a
score value as done in STEP 3.
If Hinglish, check in the dictionaries prepared and once we get a final English
conversion token repeat STEP 6.
Save the final scores with the labelled sentiment in a file.
Reset all scores to zero and REPEAT from STEP 2.

ALGORITHM

gitvivekgupta/Sentiment-Analysis

Sentiment Analysis for Indian Languages. Contribute to gitvivekgupta/Sentiment-Analysis development by creating an…

github.com

3. RESULTS AND DISCUSSIONS

In this section we will evaluate our results based on the outputs of above processes.
This is the last stage of our pipeline where we deploy our algorithms to generate classification among positive, neutral and negative sentences.

⏩ One Feature Model

Here, we consider our first model in which we had computed a single sentence level score [“sent_score”].

After doing all the above operations, pre-processing and feature engineering we will get a file as shown below.

It will look something like this:

Here, we can see, we have a score and a value for every sentence in training data.

4. ALGORITHMS

Here, we will discuss results of different algorithms.

⏩ One Feature Model

A. K-Nearest Neighbors

STEPS

1. Select the file as shown above.
2. Read the score and value columns.
3. Apply split procedure
4. Apply KNN Algorithm over different values of k
5. Plot score value at each k
6. Cross-Validate the results

After applying all the steps as mentioned above we observed the following plot as expected in STEP 5.

As per the above observation, Accuracy is: 45.76% highest, at k = 52

We also did normalization on the above feature data file we had big features values, it was done in order to account for any imbalance caused due to large values in feature data file.

We divided every feature value i.e. the score in above feature data file by 100 and ran the same process again.

Our observation shows that there was no great effect on the output, but a slight fall in accuracy was derived.

Accuracy is: 45.73% highest, at k = 52

Cross Validation and Hyper Parameter Tuning

This is the last step of the analysis were we evaluate our results so as to avoid over-fitting in the model and also to find the optimal value of k.

The over-fitting is a flaw in machine learning systems that is caused by the memorization of training data producing poor results for unseen test data i.e. model is poor to generalize [25].

We performed 10 fold cross validation, the most general approach for evaluating the systems.
We observed, the error is minimum at k = 129, and the accuracy is 45.42%
We can say that our model marginally overfits but the effect is negligible.
We also ran the same cross validation on normalized feature data file.
We observed, the error is minimum at k = 79, and the accuracy is 44.89%
As the value of k reduces, it signifies that the model is more complex in nature and hence it is a better choice to not to normalize the feature data file values.

⏩ Three Feature Model

Here, we consider our first model in which we had computed a three different sentence level scores [“pos_score”, “neu_score” and “neg_score”].

After doing all the above operations, pre-processing and feature engineering we will get a file as shown below.

It will look something like this:

Here, we can see, we have a score and a value for every sentence in training data.

A. K-Nearest Neighbors

STEPS

1. Select the file as shown above.
2. Read the score and value columns.
3. Apply split procedure
4. Apply KNN Algorithm over different values of k
5. Plot score value at each k
6. Cross-Validate the results

After applying all the steps as mentioned above we observed the following
plot as expected in STEP 5.

As per the above observation, Accuracy is: 48.1% highest, at k = 359

We also did normalization on the above feature data file we had big features values, it was done in order to account for any imbalance caused due to large values in feature data file.

We divided every feature value i.e. the score in above feature data file by 100 and ran the same process again.

Our observation shows that there was no great effect on the output, but a slight
fall in accuracy was derived.

Accuracy is: 48.0% highest, at k = 244

Cross Validation and Hyper Parameter Tuning

This is the last step of the analysis were we evaluate our results so as to avoid over-fitting in the model and also to find the optimal value of k.

We performed 10 fold cross validation, the most general approach for
evaluating the systems.

We observed, the error is minimum at k = 151, and the accuracy is 47.40%
We can say that our model marginally over-fits but the effect is negligible.
We also ran the same cross validation on normalized feature data file.

We observed, the error is minimum at k = 273, and the accuracy is 47.80%

In this case we prefer to normalize the feature data file as the value of k reduces, it signifies that the model is more complex in nature and hence it is a better choice to normalize the feature data file values not only for simpler model but also the accuracy gain of .4%

B. Support Vector Machines

We present our observations in the below chart
We have applied different feature scaling techniques and recorded the observations.

Radial Basis Kernel:

The highest accuracy attained is 48.67%

The values of hyper-parameters are set to:
C is set to be in between [1, 100]
Gamma is set to take values of [.001, .01, .1, 1, 2, 3, 4, 5]

Sigmoid Kernel:

The highest accuracy attained is less than the RBF kernel

Linear Kernel:

The highest accuracy attained is less than the RBF kernel

Polynomial Kernel:

The highest accuracy attained is less than the RBF kernel

C. Tree Models

Here, we will discuss results of three tree based algorithms.
We present our observations in below graphs.

Decision Tree

The highest accuracy attained is 48.1%
Our observation shows that, the depth and split parameters in algorithm are:
Split is in between [4000, 5000]
Depth is in between [0, 500]

Random Forest

The highest accuracy attained is 45.91%
It is less than Decision Tree Classifier.

Gradient Boosted Decision Tree

The highest accuracy attained is 48.45%

D. Naive Bayes

We present our observations below

NBC Gaussian: 45.23
NBC Multinomial: 46.10
NBC Bernoulli: 44.15

⏩ Four Feature Model

Here, we consider our first model in which we had computed a three different sentence level scores [“pos_score”, “neu_score”, “neg_score” and “sub_score”].

After doing all the above operations, pre-processing and feature engineering we will get a file as shown below.

It will look something like this:

Here, we can see, we have a score and a value for every sentence in training data.

A. K-Nearest Neighbors

STEPS

1. Select the file as shown above.
2. Read the score and value columns.
3. Apply split procedure
4. Apply KNN Algorithm over different values of k
5. Plot score value at each k
6. Cross-Validate the results

After applying all the steps as mentioned above we observed the following plot as expected in STEP 5.

As per the above observation, Accuracy is: 48.0% highest, at k = 360

We also did normalization on the above feature data file we had big features values, it was done in order to account for any imbalance caused due to large values in feature data file.

We divided every feature value i.e. the score in above feature data file by 100 and ran the same process again.

Our observation shows that there was no great effect on the output, but a slight increase in accuracy was gained.

Accuracy is: 48.1% highest, at k = 360

Cross Validation and Hyper Parameter Tuning

This is the last step of the analysis were we evaluate our results so as to avoid over-fitting in the model and also to find the optimal value of k.

We performed 10 fold cross validation, the most general approach for evaluating the systems.

We observed, the error is minimum at k = 139, and the accuracy is 47.40%
We can say that our model marginally overfits but the effect is negligible.
We also ran the same cross validation on normalized feature data file.

We observed, the error is minimum at k = 139, and the accuracy is 47.40%

Result

B. Support Vector Machines

We present our observations in the below chart
We have applied different feature scaling techniques and recorded the observations.

Radial Basis Kernel:

The highest accuracy attained is 48.39%
The values of hyper parameters are set to:
C is set to be in between [1, 100]
Gamma is set to take values of [.001, .01, .1, 1, 2, 3, 4, 5]

Linear Kernel:

The highest accuracy attained is less than the RBF kernel

Polynomial Kernel:

The highest accuracy attained is less than the RBF kernel

C. Tree Models

Here, we will discuss results of three tree based algorithms.
We present our observations in below graphs.

DECISION TREE

The highest accuracy attained is 47.96%
Our observation shows that, the depth and split parameters in algorithm are:
Split is in between [4000, 5000]
Depth is in between [0, 500]

Random Forest

The highest accuracy attained is 47.03%
It is less than Decision Tree Classifier.

Gradient Boosted Decision Tree

The highest accuracy attained is 49.44%

D. Naive Bayes

We present our observations below

NBC Gaussian: 45.08
NBC Multinomial: 46.16
NBC Bernoulli: 44.58

⏩ Six Feature Model

A. K-Nearest Neighbors

STEPS

1. Select the file as shown above.
2. Read the score and value columns.
3. Apply split procedure
4. Apply KNN Algorithm over different values of k
5. Plot score value at each k
6. Cross-Validate the results

After applying all the steps as mentioned above we observed the following plot as expected in STEP 5.

As per the above observation, Accuracy is: 47.99% highest, at k = 244

We also did normalization on the above feature data file we had big features values, it was done in order to account for any imbalance caused due to large values in feature data file.

We divided every feature value i.e. the score in above feature data file by 100 and ran the same process again.

Our observation shows that there was no great effect on the output, but a slight increase in accuracy was gained.

Accuracy is: 48.0% highest, at k = 244

Cross Validation and Hyper Parameter Tuning

This is the last step of the analysis were we evaluate our results so as to avoid over-fitting in the model and also to find the optimal value of k.

We performed 10 fold cross validation, the most general approach for evaluating the systems.

We observed, the error is minimum at k = 140, and the accuracy is 47.50%
We can say that our model marginally overfits but the effect is negligible.
We also ran the same cross validation on normalized feature data file.
We observed, the error is minimum at k = 139, and the accuracy is 47.50%

B. Support Vector Machines

We present our observations in the below chart
We have applied different feature scaling techniques and recorded the observations.

Radial Basis Kernel:

The highest accuracy attained is 48.94%
The values of hyper parameters are set to:
C is set to be in between [1, 100]
Gamma is set to take values of [.001, .01, .1, 1, 2, 3, 4, 5]

Linear Kernel:

The highest accuracy attained is less than the RBF kernel

Polynomial Kernel:

The highest accuracy attained is less than the RBF kernel

C. Tree Models

Here, we will discuss results of three tree based algorithms.
We present our observations in below graphs.

DECISION TREE

The highest accuracy attained is 47.96%
Our observation shows that, the depth and split parameters in algorithm are:
Split is in between [4000, 6000]
Depth is in between [0, 500]