Texthero in NLP: Simplest way to work on text data using Pandas

3 min readJul 9, 2020

From Zero to Hero in Natural Language Processing (NLP)

Texthero is a python toolkit to work with text-based dataset quickly and effortlessly. Texthero is the simplest way to clean and preprocess text data using python on top of Pandas library. Texthero has the same expressiveness and power of Pandas and is extensively documented. With Texthero preprocessing text data, mapping it into vectors and visualizing the obtained vector space takes just a couple of lines. Texthero is modern and conceived for programmers of the 2020 decade with little knowledge if any in linguistics.

Texthero include tools for:

● Preprocess text data: it offers both out-of-the-box solutions but it’s also flexible for custom-solutions.

● Natural Language Processing: keyphrases and keywords extraction, and named entity recognition.

● Text representation: TF-IDF, term frequency, and custom word-embeddings (wip)

● Vector space analysis: clustering (K-means, Meanshift, DBSAN and Hierarchical), topic modelling (wip) and interpretation.

● Text visualization: vector space visualization, place localization on maps (wip).

How to Install Texthero Library:

Texthero is free, open source and well documented. To install the library use pip command.

Simple pipeline for text cleaning

>>> import texthero as hero
>>> import pandas as pd
>>> text = "This sèntencé    (123 /) needs to [OK!] be cleaned!   "
>>> s = pd.Series(text)
>>> s
0    This sèntencé    (123 /) needs to [OK!] be cleane...
dtype: object

Remove all digits:

>>> s = hero.remove_digits(s)
>>> s
0    This sèntencé    (  /) needs to [OK!] be cleaned!
dtype: object

Remove all type of brackets and their content:

>>> s = hero.remove_brackets(s)
>>> s 
0    This sèntencé    needs to  be cleaned!
dtype: object

Remove diacritics:

>>> s = hero.remove_diacritics(s)
>>> s 
0    This sentence    needs to  be cleaned!
dtype: object

Remove punctuation:

>>> s = hero.remove_punctuation(s)
>>> s 
0    This sentence    needs to  be cleaned
dtype: object

Remove extra white-spaces:

>>> s = hero.remove_whitespace(s)
>>> s 
0    This sentence needs to be cleaned
dtype: object

Sometimes we also wants to get rid of stop-words:

Example: Text preprocessing, TF-IDF, K-means and visualization

import texthero as hero
import pandas as pd

df = pd.read_csv(
    import texthero as hero
import pandas as pddf = pd.read_csv(
    "https://github.com/jbesomi/texthero/raw/master/dataset/bbcsport.csv"
)df['tfidf'] = (
    df['text']
    .pipe(hero.clean)
    .pipe(hero.tfidf)
)df['kmeans_labels'] = (
    df['tfidf']
    .pipe(hero.kmeans, n_clusters=5)
    .astype(str)
)df['pca'] = df['tfidf'].pipe(hero.pca)hero.scatterplot(df, 'pca', color='kmeans_labels', title="K-means BBC Sport news")

Thankful to the contributor of this amazing library…..

Contributors are (in chronological order)

Selim Al
Awwa Parth
Gandhi Dan Keefe

https://www.linkedin.com/in/suneelpatel/

Texthero in NLP: Simplest way to work on text data using Pandas

How to Install Texthero Library:

Simple pipeline for text cleaning

Example: Text preprocessing, TF-IDF, K-means and visualization

https://www.linkedin.com/in/suneelpatel/

Written by Suneel Patel

No responses yet