i have to discover the many predictive key words and/or expressions to accurately classify the the dating advice and relationship advice subreddit pages therefore we may use them to ascertain which ads should populate for each web page. Because North Dakota payday loans laws this is a category issue, we’ll utilize Logistic Regression & Bayes models. Misclassifications in this full instance will be fairly safe so I will make use of the precision rating and set up a baseline of 63.3per cent to rate success. Making use of TFiDfVectorization, I’ll get the function value to ascertain which terms have actually the prediction power that is highest for the prospective factors. If effective, this model is also utilized to focus on other pages which have comparable frequency regarding the words that are same expressions.
See relationship-advice-scrape and dating-advice-scrape notebooks with this part.
After turning most of the scrapes into DataFrames, they were saved by me as csvs that you can get into the dataset folder of the repo.
Information Cleaning and EDA
- dropped rows with null self text column becuase those rows are worthless in my opinion.
- combined name and selftext column directly into one brand brand brand new all_text columns
- exambined distributions of term counts for games and selftext column per post and contrasted the 2 subreddit pages.
Preprocessing and Modeling
Found the baseline precision rating 0.633 which means that if i find the value that develops most frequently, i will be appropriate 63.3% of that time.
First attempt: logistic regression model with default CountVectorizer paramaters. train score: 99 | test 75 | cross val 74 Second attempt: tried CountVectorizer with Stemmatizer preprocessing on first pair of scraping, pretty bad rating with a high variance. Train 99%, test 72%
- attempted to decrease maximum features and rating got a whole lot worse
- tried with lemmatizer preprocessing instead and test score went as much as 74percent
Just increasing the information and y that is stratifying my test/train/split increased my cvec test score to 81 and cross val to 80. Incorporating 2 paramaters to my CountVectorizers helped a great deal. A min_df of 3 and ngram_range of (1,2) increased my test score to 83.2 and get a cross val to 82.3 Nonetheless, these rating disappeared.
I do believe Tfidf worked the very best to diminish my overfitting due to variance problem because
we customized the end words to just simply take away the ones that have been really too regular to be predictive. This is a success, nevertheless, with an increase of time we most likely could’ve tweaked them a little more to improve all ratings. Taking a look at both the solitary terms and terms in sets of two (bigrams) ended up being the most useful param that gridsearch proposed, but, each of my top many predictive terms finished up being uni-grams. My list that is original of had a good amount of jibberish terms and typos. Minimizing the # of that time period term ended up being needed to show as much as 2, helped be rid of the. Gridsearch additionally advised 90% max df rate which aided to remove oversaturated terms also. Finally, establishing max features to 5000 reduced cut down my columns to about 25 % of whatever they had been to simply concentrate the essential frequently employed terms of that which was kept.
Conclusion and tips
Also though i’d like to have greater train and test ratings, I became in a position to effectively reduce the variance and you can find undoubtedly a few terms which have high predictive energy
therefore I think the model is prepared to launch a test. The same key words could be used to find other potentially lucrative pages if advertising engagement increases. I discovered it interesting that taking out fully the overly used terms assisted with overfitting, but brought the precision rating down. I believe there is certainly probably nevertheless space to relax and play around with the paramaters associated with Tfidf Vectorizer to see if various end terms create an or that is different
Used Reddit’s API, demands collection, and BeautifulSoup to clean articles from two subreddits: Dating guidance & union guidance, and trained a classification that is binary to anticipate which subreddit confirmed post originated from