term_frequency_feature_extractor
Module Contents
Classes
Build a dataframe with a distribution of term frequencies |
- class term_frequency_feature_extractor.TermFrequencyFeatureExtractor(n_bins: int = 40)
Bases:
src.feature_extractors.base_extractor.BaseExtractor
Build a dataframe with a distribution of term frequencies
- Usage:
>>> data = pd.read_csv("data/raw/train.csv").set_index("text_id") >>> featurizer = TermFrequencyFeaturizer() >>> X = featurizer.featurize(data.full_text) >>> y = data["vocabulary"] >>> model = catboost.CatBoostRegressor() >>> model.fit(x_train, y_train)
- Possible improvements:
Add word corrections: triying -> trying
Count not only word frequencies, but number of unique words in each hist bin
- MAX_TERM_FREQUENCY = 23135751162
- generate_features(data: pandas.Series) pandas.DataFrame
Extracts features from the text in the form of histogram of word frequencies
Logarithm operation is applied to the frequencies for the sake of distribution normality.