Naive Bayes Algorithm - Challenges in Prediction
Hi,
I have used the ML module to implement the Naive Bayes algorithm for the Classification problem. I have created the Independent Variables (in my case, Word bag) and Dependent variable (in my case, Classification id). It is similar to Sentiment analysis. I have created a training dataset (around 10K records) with 3 classification id. I have created the model with the training dataset (Independent and Dependent variables). After creating the Model, I have executed against the original dataset (with 400K records which includes the Training dataset as well). I was expecting the Training Dataset information should have been classified correctly. However, when I looked at the results, my training dataset records were not classified correctly. How do I resolve this issue and improve the accuracy of Predicted Classification?
Regards,
Subbu
I have used the ML module to implement the Naive Bayes algorithm for the Classification problem. I have created the Independent Variables (in my case, Word bag) and Dependent variable (in my case, Classification id). It is similar to Sentiment analysis. I have created a training dataset (around 10K records) with 3 classification id. I have created the model with the training dataset (Independent and Dependent variables). After creating the Model, I have executed against the original dataset (with 400K records which includes the Training dataset as well). I was expecting the Training Dataset information should have been classified correctly. However, when I looked at the results, my training dataset records were not classified correctly. How do I resolve this issue and improve the accuracy of Predicted Classification?
Regards,
Subbu
- kps_mani
- Posts: 24
- Joined: Wed Mar 04, 2015 3:42 pm
Before using your own data, I would run the example, ML.Tests.Explanatory.Naive_Bayes.ecl. And, make sure the model produced does a good job classifying the training set (when I ran it, it did).
Then, use the example as a guide to setting up your code to train and classify your own data.
Then, use the example as a guide to setting up your code to train and classify your own data.
- tlhumphrey2
- Posts: 260
- Joined: Mon May 07, 2012 6:23 pm
Hi,
I could not see the ML.Tests.Explanatory.Naive_Bayes.ecl in the ML Beta version of Library. Can you please let me know if you have latest version of ML?
Regards,
Subbu
I could not see the ML.Tests.Explanatory.Naive_Bayes.ecl in the ML Beta version of Library. Can you please let me know if you have latest version of ML?
Regards,
Subbu
- kps_mani
- Posts: 24
- Joined: Wed Mar 04, 2015 3:42 pm
I shortened the example code, ML.Tests.Explanatory.NaiveBayes.ecl, to the following:
The last line of this code is "OUTPUT(D_compare);". It outputs statistics that show the accuracy of the predicted vs the training set's dependent dataset. You should notice that the accuracy is around 80%.
- Code: Select all
IMPORT ML;
//NaiveBayes classifier
trainer:= ML.Classify.NaiveBayes();
// Monk Dataset - Discrete dataset 124 instances x 6 attributes + class
MonkData:= ML.Tests.Explanatory.MonkDS.Train_Data;
ML.ToField(MonkData, fullmds, id);
full_mds:=PROJECT(fullmds, TRANSFORM(ML.Types.DiscreteField, SELF:= LEFT));
indepDataD:= full_mds(number>1);
depDataD := full_mds(number=1);
// Learning Phase
D_Model:= trainer.LearnD(indepDataD, depDataD);
dmodel:= trainer.Model(D_model);
// Classification Phase
D_classDist:= trainer.ClassProbDistribD(indepDataD, D_Model); // Class Probalility Distribution
D_results:= trainer.ClassifyD(indepDataD, D_Model);
OUTPUT(D_results);
// Performance Metrics
D_compare:= ML.Classify.Compare(depDataD, D_results); // Comparing results with original class
OUTPUT(D_compare);
The last line of this code is "OUTPUT(D_compare);". It outputs statistics that show the accuracy of the predicted vs the training set's dependent dataset. You should notice that the accuracy is around 80%.
- tlhumphrey2
- Posts: 260
- Joined: Mon May 07, 2012 6:23 pm
The following shows a tree diagram of ecl-ml. Notice where the folder Explanatory is at. Naive_Bayes.ecl is there.
- Code: Select all
+---docs
| \---images
+---Examples
| \---Sentilyze
| +---KeywordCount
| \---NaiveBayes
+---ML
| +---DMat
| +---Docs
| +---LDA
| +---Mat
| +---Regression
| | +---Dense
| | \---Sparse
| +---StepRegression
| +---StepwiseLogistic
| +---SVM
| | \---LibSVM
| | \---Test
| \---Tests
| +---Benchmarks
| +---Deprecated
| +---Explanatory
| \---Validation
+---PBblas
| +---BLAS
| +---Block
| +---LAPACK
| \---Tests
+---TS
| \---Demo
\---VL
\---XSLT
- tlhumphrey2
- Posts: 260
- Joined: Mon May 07, 2012 6:23 pm
Hi,
Here is the outcome of the Model with the training data set for the classification id (10, 20, 30)..
classfier c_actual c_modeled cnt
1 10 10 33
1 10 20 566
1 10 30 30
1 20 10 37
1 20 20 1660
1 20 30 84
1 30 10 38
1 30 20 49
1 30 30 2196
classifier c_modeled precision
1 10 30.55555555555556
1 20 72.96703296703296
1 30 95.06493506493507
classifier accuracy
1 82.86810142765822
Why am I not seeing better precision and modeling for Classification Id 10 & 20 whereas I see better precision and modeling for Classification Id 30? Any idea or suggestions?
Regards,
Subbu
Here is the outcome of the Model with the training data set for the classification id (10, 20, 30)..
classfier c_actual c_modeled cnt
1 10 10 33
1 10 20 566
1 10 30 30
1 20 10 37
1 20 20 1660
1 20 30 84
1 30 10 38
1 30 20 49
1 30 30 2196
classifier c_modeled precision
1 10 30.55555555555556
1 20 72.96703296703296
1 30 95.06493506493507
classifier accuracy
1 82.86810142765822
Why am I not seeing better precision and modeling for Classification Id 10 & 20 whereas I see better precision and modeling for Classification Id 30? Any idea or suggestions?
Regards,
Subbu
- kps_mani
- Posts: 24
- Joined: Wed Mar 04, 2015 3:42 pm
I have attached the Dependent and Independent Variables which are used for training dataset for your reference. Can you please look at it and comment on why classification is not working properly for Classification Id 10 & 20?
Regards,
Subbu
Regards,
Subbu
- Attachments
-
independenttrainervariable.txt
- (197.07 KiB) Downloaded 424 times
-
dependenttrainervariable.txt
- (68.75 KiB) Downloaded 436 times
- kps_mani
- Posts: 24
- Joined: Wed Mar 04, 2015 3:42 pm
The 2 files you attached aren't in the format needed for input into the ML classifiers. Attach you ecl code, I will modify it so the dependent and independent data is in the correct format and then attach the changed code to my next post.
- tlhumphrey2
- Posts: 260
- Joined: Mon May 07, 2012 6:23 pm
I have downloaded them as CSV file and I believe that you will be still able to load them in the Dataset and use it directly for ML with required transformation. Do you still need ECL code?
Regards,
Subbu
Regards,
Subbu
- kps_mani
- Posts: 24
- Joined: Wed Mar 04, 2015 3:42 pm
10 posts
• Page 1 of 1
Who is online
Users browsing this forum: No registered users and 1 guest