As expected, I started out last week by trying to speed up the process of generating text based on larger batches of random vectors and optimize how it was done. It went very successfully, outside of having to run the objective function separately for each sentence. This did speed up the process somewhat, letting there be more iterations, and despite my best hopes, it didn’t help the quality at all. Eventually, I decided to compare how the objective function (which gives a number that demonstrates how “good” a sentence is) rated a terrible generated sentence (generally it would pick a common word, such as “and” or “with”, and repeat it the entire time) versus a preexisting sentence. Unfortunately, it rated the bad sentences extremely well- if I had to guess, it gauges how probable each word is and scores very common words low. This helped explain why all of the sentences were terrible, even if they were noise for an embedded sentence- the objective function itself was flawed.
To remedy that, I made a dataset of “good” sentences (the approximately 400,000 sentences in the training set) and “bad” sentences (approximately the same number of sentences generated from random vectors in the dataset similar to the outputs I wanted to avoid). From there, I made my own objective function- essentially a classifier. This classifier has flaws that I need to fix- I forgot to regularize the outputs, so it’s less of a score that can be improved on and more 0 or 1. Even with a flawed classifier, the results so far have been more promising! The sentences generated using the vectors as noise somewhat replicate the sentence now, which is a plus. I think once that’s been tweaked, things should be even better, especially since this can be batched and speeds the process up.
The other thing I’ve been debating is rebuilding the classifier I use for positive or negative sentiment. Right now, it runs on sentence embeddings, which is ok- but given how little information the embeddings seem to contain if they’re entirely random, I decided it might be better to run it on the tokens, which are pregenerated anyway. Giving them more informaton should hopefully increase the reliability of the network on bad sentences. This also would eliminate some time in the long run, since the sentences would not have to be converted into embeddings. Overall, I expect that instituting some changes might increase the reliability of my network.