In 2017 I was working on a model trainer for text classification and sequence labeling [1] that had limited success because the models weren't good enough.
I have a minilm + pooling + svm classifier which works pretty well for some things (topics, "will I like this article?") but doesn't work so well for sentiment, emotional tone and other things where the order of the words matter. I'm planning to upgrade my current classifier's front end to use ModernBert and add an LSTM-based back end that I think will equal or beat fine-tuned BERT and, more importantly, can be trained reliably with early stopping. I'd like to open source the thing, focused on reliability, because I'm an application programmer at heart.
I want it to provide an interface which is text-in and labels-out and hide the embeddings from most users but I'm definitely thinking about how to handle them, and there's the worse problem here that the LSTM needs a vector for each token, not each document, so text gets puffed up by a factor of 1000 or so which is not insurmountable (1 MB of training text puffs up to 1 GB of vectors)
Since it's expensive to compute the embeddings and expensive to store them I'm thinking about whether and how to cache them, considering that I expect to present the same samples to the trainer multiple times and to do a lot of model selection in the process of model development (e.g. what exact shape LSTM to to use) and in the case of end-user training (it will probably try a few models, not least do a shootout between the expensive model and a cheap model)_
[1] think of a "magic magic marker" which learns to mark up text the same way you do; this could mark "needless words" you could delete from a title, parts of speech, named entities, etc.
I have a minilm + pooling + svm classifier which works pretty well for some things (topics, "will I like this article?") but doesn't work so well for sentiment, emotional tone and other things where the order of the words matter. I'm planning to upgrade my current classifier's front end to use ModernBert and add an LSTM-based back end that I think will equal or beat fine-tuned BERT and, more importantly, can be trained reliably with early stopping. I'd like to open source the thing, focused on reliability, because I'm an application programmer at heart.
I want it to provide an interface which is text-in and labels-out and hide the embeddings from most users but I'm definitely thinking about how to handle them, and there's the worse problem here that the LSTM needs a vector for each token, not each document, so text gets puffed up by a factor of 1000 or so which is not insurmountable (1 MB of training text puffs up to 1 GB of vectors)
Since it's expensive to compute the embeddings and expensive to store them I'm thinking about whether and how to cache them, considering that I expect to present the same samples to the trainer multiple times and to do a lot of model selection in the process of model development (e.g. what exact shape LSTM to to use) and in the case of end-user training (it will probably try a few models, not least do a shootout between the expensive model and a cheap model)_
[1] think of a "magic magic marker" which learns to mark up text the same way you do; this could mark "needless words" you could delete from a title, parts of speech, named entities, etc.