Named Entity Recognition (NER) | Custom Advanced NLP Tool using spaCy
Before understanding what the Named Identity Recognition (NER) is, we must ask ourselves why do we need to know about the NER?
As we all know today, the data world is filled with unstructured data like PDFs, documents, web pages, social media posts, survey feedback and much more. The most important work for people working in the Data World is to define the best method to extract meaningful information from human language text data which we know as information extraction.
Here, We need a system or machine that extracts information for us from textual data to answer or solve many real-world questions or problems.
This system is named entity recognition (NER) which is probably the first step towards information extraction that attempts to identify and classify named entities in the text such as names, locations, things, organizations, times of people in predefined categories. Price, monetary value, percentage, product etc.
What is Named Entities and Named Entity Recognition?
Named-entity recognition (NER) is the method or system of extracting information which allows us to properly understand the subject or topic of the raw text. It is a process of automatically identifying the named entities present in a given text of any documents and classifying them into predefined categories such as ‘person’, ‘organization’, ‘location’ and so on.
The goal of a named entity recognition (NER) system is to identify all textual mentions of the named entities and also classify them. It helps to solve many real world problems in Natural Language Processing (NLP). NER is also simply known as entity identification, entity chunking and entity extraction.
Commonly Used Types of Named Entity are :
ORGANIZATION: Georgia-Pacific Corp., WHO
PERSON: Eddy Bonte, President Obama
LOCATION: Murray River, Mount Everest
DATE: June, 2008–06–29
TIME: two fifty a m, 1:30 p.m.
MONEY: 175 million Canadian Dollars, GBP 10.40
PERCENT: twenty pct, 18.75 %
FACILITY: Washington Monument, Stonehenge
GPE: South East Asia, Midlothian
NER using spaCy
In this article, we discuss how to build a system or recognizer to extract named entities using Spacy library.
What’s spaCy?
spaCy is a free, open-source library for advanced Natural Language Processing (NLP) in Python.
If you’re working with a lot of text, you’ll eventually want to know more about it. For example, what’s it about? What do the words mean in context? Who is doing what to whom? What companies and products are mentioned? Which texts are similar to each other?
spaCy is designed specifically for production use and helps you build applications that process and “understand” large volumes of text. It can be used to build information extraction or natural language understanding systems, or to pre-process text for deep learning.
Installation
Using pip
SpaCy can be installed using a simple pip
install. Using pip, spaCy releases are available as source packages and binary wheels. Before you install spaCy and its dependencies, make sure that your pip
, setuptools
and wheel
are up to date.
pip install -U pip setuptools wheel
pip install -U spacy
Using Conda
Thanks to our great community, we’ve been able to re-add conda support. You can also install spaCy via conda-forge
:
conda install -c conda-forge spacy
Let’s begin!
# Load a spacy model and chekc if it has ner
import spacy
nlp=spacy.load('en_core_web_sm')
nlp.pipe_names#> ['tagger', 'parser', 'ner']
Let’s have a look at how the default NER performs on an article about E-commerce companies.
# Performing NER on E-commerce article
article_text = """India that previously comprised only a handful of players in the e-commerce space, is now home to many biggies and giants battling out with each other to reach the top. This is thanks to the overwhelming internet and smartphone penetration coupled with the ever-increasing digital adoption across the country. These new-age innovations not only gave emerging startups a unique platform to deliver seamless shopping experiences but also provided brick and mortar stores with a level-playing field to begin their online journeys without leaving their offline legacies.
In the wake of so many players coming together on one platform, the Indian e-commerce market is envisioned to reach USD 84 billion in 2021 from USD 24 billion in 2017. Further, with the rate at which internet penetration is increasing, we can expect more and more international retailers coming to India in addition to a large pool of new startups. This, in turn, will provide a major Philip to the organized retail market and boost its share from 12% in 2017 to 22-25% by 2021.
Here’s a view to the e-commerce giants that are dominating India’s online shopping space:
Amazon – One of the uncontested global leaders, Amazon started its journey as a simple online bookstore that gradually expanded its reach to provide a large suite of diversified products including media, furniture, food, and electronics, among others. And now with the launch of Amazon Prime and Amazon Music Limited, it has taken customer experience to a godly level, which will remain undefeatable for a very long time.
Flipkart – Founded in 2007, Flipkart is recognized as the national leader in the Indian e-commerce market. Just like Amazon, it started operating by selling books and then entered other categories such as electronics, fashion, and lifestyle, mobile phones, etc. And now that it has been acquired by Walmart, one of the largest leading platforms of e-commerce in the US, it has also raised its bar of customer offerings in all aspects and giving huge competition to Amazon.
Snapdeal – Started as a daily deals platform in 2010, Snapdeal became a full-fledged online marketplace in 2011 comprising more than 3 lac sellers across India. The platform offers over 30 million products across 800+ diverse categories from over 125,000 regional, national, and international brands and retailers. The Indian e-commerce firm follows a robust strategy to stay at the forefront of innovation and deliver seamless customer offerings to its wide customer base. It has shown great potential for recovery in recent years despite losing Freecharge and Unicommerce.
ShopClues – Another renowned name in the Indian e-commerce industry, ShopClues was founded in July 2011. It’s a Gurugram based company having a current valuation of INR 1.1 billion and is backed by prominent names including Nexus Venture Partners, Tiger Global, and Helion Ventures as its major investors. Presently, the platform comprises more than 5 lac sellers selling products in nine different categories such as computers, cameras, mobiles, etc.
Paytm Mall – To compete with the existing e-commerce giants, Paytm, an online payment system has also launched its online marketplace – Paytm Mall, which offers a wide array of products ranging from men and women fashion to groceries and cosmetics, electronics and home products, and many more. The unique thing about this platform is that it serves as a medium for third parties to sell their products directly through the widely-known app – Paytm.
Reliance Retail – Given Reliance Jio’s disruptive venture in the Indian telecom space along with a solid market presence of Reliance, it is no wonder that Reliance will soon be foraying into retail space. As of now, it has plans to build an e-commerce space that will be established on online-to-offline market program and aim to bring local merchants on board to help them boost their sales and compete with the existing industry leaders.
Big Basket – India’s biggest online supermarket, Big Basket provides a wide variety of imported and gourmet products through two types of delivery services – express delivery and slotted delivery. It also offers pre-cut fruits along with a long list of beverages including fresh juices, cold drinks, hot teas, etc. Moreover, it not only provides farm-fresh products but also ensures that the farmer gets better prices.
Grofers – One of the leading e-commerce players in the grocery segment, Grofers started its operations in 2013 and has reached overwhelming heights in the last 5 years. Its wide range of products includes atta, milk, oil, daily need products, vegetables, dairy products, juices, beverages, among others. With its growing reach across India, it has become one of the favorite supermarkets for Indian consumers who want to shop grocery items from the comforts of their homes.
Digital Mall of Asia – Going live in 2020, Digital Mall of Asia is a very unique concept coined by the founders of Yokeasia Malls. It is designed to provide an immersive digital space equipped with multiple visual and sensory elements to sellers and shoppers. It will also give retailers exclusive rights to sell a particular product category or brand in their respective cities. What makes it unique is its zero-commission model enabling retailers to pay only a fixed amount of monthly rental instead of paying commissions. With its one-of-a-kind features, DMA is expected to bring
never-seen transformation to the current e-commerce ecosystem while addressing all the existing e-commerce worries such as counterfeiting. """
spaCy has the property ents
on Doc
objects. You can use it to extract named entities:
doc=nlp(article_text)
for ent in doc.ents:
print(ent.text,ent.label_)India GPE
#> one CARDINAL
#> Indian NORP
#> USD 84 billion MONEY
#> 2021 DATE
#> USD 24 billion MONEY
#> 2017 DATE
#> India GPE
#> Philip PERSON
#> 12% PERCENT
#> 2017 DATE
#> 22-25% PERCENT
#> 2021 DATE
Another Example:
>>> piano_class_text = ('Great Piano Academy is situated'
... ' in Mayfair or the City of London and has'
... ' world-class piano instructors.')
>>> piano_class_doc = nlp(piano_class_text)
>>> for ent in piano_class_doc.ents:
... print(ent.text, ent.start_char, ent.end_char,
... ent.label_, spacy.explain(ent.label_))
...
Great Piano Academy 0 19 ORG Companies, agencies, institutions, etc.
Mayfair 35 42 GPE Countries, cities, states
the City of London 46 64 GPE Countries, cities, states
In the above example, ent
is a Span
object with various attributes:
text
gives the Unicode text representation of the entity.start_char
denotes the character offset for the start of the entity.end_char
denotes the character offset for the end of the entity.label_
gives the label of the entity.
spacy.explain
gives descriptive details about an entity label. The spaCy model has a pre-trained list of entity classes. You can use displaCy to visualize these entities:
>>> displacy.serve(piano_class_doc, style='ent')
Conclusion
I hope you have now understood how to train your own NER model on top of the spaCy NER model. Thanks for reading! 😊