Texthero in NLP: Simplest way to work on text data using Pandas
From Zero to Hero in Natural Language Processing (NLP)
Texthero is a python toolkit to work with text-based dataset quickly and effortlessly. Texthero is the simplest way to clean and preprocess text data using python on top of Pandas library. Texthero has the same expressiveness and power of Pandas and is extensively documented. With Texthero preprocessing text data, mapping it into vectors and visualizing the obtained vector space takes just a couple of lines. Texthero is modern and conceived for programmers of the 2020 decade with little knowledge if any in linguistics.
Texthero include tools for:
● Preprocess text data: it offers both out-of-the-box solutions but it’s also flexible for custom-solutions.
● Natural Language Processing: keyphrases and keywords extraction, and named entity recognition.
● Text representation: TF-IDF, term frequency, and custom word-embeddings (wip)
● Vector space analysis: clustering (K-means, Meanshift, DBSAN and Hierarchical), topic modelling (wip) and interpretation.
● Text visualization: vector space visualization, place localization on maps (wip).
How to Install Texthero Library:
Texthero is free, open source and well documented. To install the library use pip command.
Simple pipeline for text cleaning
>>> import texthero as hero
>>> import pandas as pd
>>> text = "This sèntencé (123 /) needs to [OK!] be cleaned! "
>>> s = pd.Series(text)
>>> s
0 This sèntencé (123 /) needs to [OK!] be cleane...
dtype: object
Remove all digits:
>>> s = hero.remove_digits(s)
>>> s
0 This sèntencé ( /) needs to [OK!] be cleaned!
dtype: object
Remove all type of brackets and their content:
>>> s = hero.remove_brackets(s)
>>> s
0 This sèntencé needs to be cleaned!
dtype: object
Remove diacritics:
>>> s = hero.remove_diacritics(s)
>>> s
0 This sentence needs to be cleaned!
dtype: object
Remove punctuation:
>>> s = hero.remove_punctuation(s)
>>> s
0 This sentence needs to be cleaned
dtype: object
Remove extra white-spaces:
>>> s = hero.remove_whitespace(s)
>>> s
0 This sentence needs to be cleaned
dtype: object
Sometimes we also wants to get rid of stop-words:
Example: Text preprocessing, TF-IDF, K-means and visualization
import texthero as hero
import pandas as pd
df = pd.read_csv(
import texthero as hero
import pandas as pddf = pd.read_csv(
"https://github.com/jbesomi/texthero/raw/master/dataset/bbcsport.csv"
)df['tfidf'] = (
df['text']
.pipe(hero.clean)
.pipe(hero.tfidf)
)df['kmeans_labels'] = (
df['tfidf']
.pipe(hero.kmeans, n_clusters=5)
.astype(str)
)df['pca'] = df['tfidf'].pipe(hero.pca)hero.scatterplot(df, 'pca', color='kmeans_labels', title="K-means BBC Sport news")
Thankful to the contributor of this amazing library…..
Contributors are (in chronological order)
- Selim Al
- Awwa Parth
- Gandhi Dan Keefe