Texthero in NLP: Simplest way to work on text data using Pandas

Suneel Patel
3 min readJul 9, 2020

--

https://pypi.org/project/texthero/

From Zero to Hero in Natural Language Processing (NLP)

Texthero is a python toolkit to work with text-based dataset quickly and effortlessly. Texthero is the simplest way to clean and preprocess text data using python on top of Pandas library. Texthero has the same expressiveness and power of Pandas and is extensively documented. With Texthero preprocessing text data, mapping it into vectors and visualizing the obtained vector space takes just a couple of lines. Texthero is modern and conceived for programmers of the 2020 decade with little knowledge if any in linguistics.

Texthero include tools for:

● Preprocess text data: it offers both out-of-the-box solutions but it’s also flexible for custom-solutions.

● Natural Language Processing: keyphrases and keywords extraction, and named entity recognition.

● Text representation: TF-IDF, term frequency, and custom word-embeddings (wip)

● Vector space analysis: clustering (K-means, Meanshift, DBSAN and Hierarchical), topic modelling (wip) and interpretation.

● Text visualization: vector space visualization, place localization on maps (wip).

How to Install Texthero Library:

Texthero is free, open source and well documented. To install the library use pip command.

Simple pipeline for text cleaning

>>> import texthero as hero
>>> import pandas as pd
>>> text = "This sèntencé (123 /) needs to [OK!] be cleaned! "
>>> s = pd.Series(text)
>>> s
0 This sèntencé (123 /) needs to [OK!] be cleane...
dtype: object

Remove all digits:

>>> s = hero.remove_digits(s)
>>> s
0 This sèntencé ( /) needs to [OK!] be cleaned!
dtype: object

Remove all type of brackets and their content:

>>> s = hero.remove_brackets(s)
>>> s
0 This sèntencé needs to be cleaned!
dtype: object

Remove diacritics:

>>> s = hero.remove_diacritics(s)
>>> s
0 This sentence needs to be cleaned!
dtype: object

Remove punctuation:

>>> s = hero.remove_punctuation(s)
>>> s
0 This sentence needs to be cleaned
dtype: object

Remove extra white-spaces:

>>> s = hero.remove_whitespace(s)
>>> s
0 This sentence needs to be cleaned
dtype: object

Sometimes we also wants to get rid of stop-words:

Example: Text preprocessing, TF-IDF, K-means and visualization

import texthero as hero
import pandas as pd

df = pd.read_csv(
import texthero as hero
import pandas as pd
df = pd.read_csv(
"https://github.com/jbesomi/texthero/raw/master/dataset/bbcsport.csv"
)
df['tfidf'] = (
df['text']
.pipe(hero.clean)
.pipe(hero.tfidf)
)
df['kmeans_labels'] = (
df['tfidf']
.pipe(hero.kmeans, n_clusters=5)
.astype(str)
)
df['pca'] = df['tfidf'].pipe(hero.pca)hero.scatterplot(df, 'pca', color='kmeans_labels', title="K-means BBC Sport news")
https://pypi.org/project/texthero/

Thankful to the contributor of this amazing library…..

Contributors are (in chronological order)

  • Selim Al
  • Awwa Parth
  • Gandhi Dan Keefe

https://www.linkedin.com/in/suneelpatel/

--

--

Suneel Patel
Suneel Patel

Written by Suneel Patel

Data Scientist and AIML Engineer with more than 10 years of experience in Data Analysis, BI Analysis, Forecasting, Optimization, NLP, and Statistical Modeling

No responses yet