Week 3

This week I finally started coding for my project! Where I left off, I was still looking for datasets to use sentiment/style transfer. Ultimately, it came down to a choice between one dataset of emails with varying degrees of politeness and one dataset of Yelp reviews (positive versus negative reviews). Eventually we agreed to use the Yelp dataset, since both of them were about the same level of accessibility, and the Yelp reviews had more clear delineations between positive and negative. (Also, the emails were incredibly dry, and I’d be looking at them for the next eight weeks or so.)

The first thing I did was create a classifier for the data that could split it into positive or negative. Apparently, to generate the spectrum of texts that I plan to have as the end goal of my project, there’s a package that takes in a classifier and a generator, so building the simpler piece (the generator) was a good starting point. A key part of this was using the sentence embeddings from RoBERTa (a highly optimized BERT) as the inputs for the classifier. I expected this to be much more difficult from my memories of Pytorch, but between the work I did at the beginning of DREU and my class last semester, it was actually surprisingly straightforward. The hardest part was figuring how to batch tensors, and even that was only an hour or so of looking through documentation. I got it to around 95% accuracy, and then tried to optimize with Optuna, a package that tries out hyperparameters. It never got above 95% or so. Looking at the sentences it misclassified, though, it seems like it is function of the dataset- seemingly, the sentences were classified by the sentiment of the overall reviews, not the sentiment of each sentence. Sentences like “The potato was good, and I ate most of it,” that come from bad reviews were marked as negative, even if the sentiment seems positive.

With that done, the next part that I’m focusing on is the LSTM that generates the content. It’s not a traditional LSTM because it’s supposed to be sequence to sequence- the basic idea is to use sentence embeddings as the hidden state and generate the entire sentence at once. I need to look further into how that works, obviously, but I’m hoping that like the classifier, this will be a fairly straightforward endeavor.

Written on June 25, 2021