Header Ads

Azure Machine Learning Studio: Multiple Language Named Entity Recognition (NER) Text Analysis

Textual analysis is one of the branch of machine learning domain that extracts interesting insights from a textual data, for example, sentiment/emotional analysis of a human behavior based on the tone in which he/she writes the text, categorizing people, organizations and locations as a separate entity formally known as Named Entity Recognition (NER) model and many more.There are many tools, technologies and languages out there in which machine learning models are written and processed such as python, R-scripts, Azure Machine Learning Studio, IBM machine learning tools and many more. Python is the most popular scripting language used for writing and processing machine learning models.

Today, I shall demonstrate Azure Machine Learning Studio Named Entity Recognition (NER) module to extract people, location and organization entities from my provided textual dataset in Urdu language. Know that Azure Machine Learning Studio Named Entity Recognition (NER) module currently supports only English language text and can only recognize people, location and organization from the text. However, I will demonstrate a very simple technique to process Azure Machine Learning Studio Named Entity Recognition (NER) module with any language. I am choosing here Urdu language as a base case. You can however, choose any other language of your choice.


Following are some prerequisites before you proceed further in this tutorial:
  1. Knowledge of Azure Machine Learning Studio.
  2. Registration on Azure Machine Learning Studio Free Account
  3. Basic understanding of machine learning Named Entity Recognition (NER) concept.
  4. Knowledge of SQLite Query Writing.
You can download the SQLlite query complete source code and sample pre-processed dataset for this tutorial. I have downloaded the sample dataset from MWaseemRandhawa GitHub Account

Download Now!

Let's begin now. 

1)  Microsoft Azure Machine Learning Studio, Named Entity Recognition (NER) module currently supports English language only. Therefore, in order to perform NER analysis on the non-English language, the first step is to translate the textual data into English language using any suitable translation API e.g. Google Translation API, Bing translation API or any other suitable translation API. So as a first step, I have converted my target Urdu language text dataset into English language text dataset using Google Translation API.

2) Next step is to import my pre-processed dataset into Azure Machine Learning Studio i.e. login to your Azure Machine Learning Studio and then import the pre-processed dataset as shown below i.e.

3) Now, create a new empty experiment and name it "Multiple Language Named Entity Recognition (NER)" as shown below i.e.

4) In the right pane, search for your imported dataset and then drag n drop your dataset on the experiment window and then right click->Dataset->Visualize on the module to view your dataset as shown below i.e.

In the above image you can see Urdu as well as English translated text of my dataset.

5) Now, search for "Select Columns Dataset" module and select only "summery_eng" column, since NER module is applied on a single column only as shown below i.e.

6) Now, search for "Named Entity Recognition" module and connect your selected English language text column which is selected previously as an input. Notice that Named Entity Recognition module do not provide any configurations as shown below i.e.

7) Run the experiment and then visualize the results that Named Entity Recognition module has compiled as shown below i.e.

In the above image you can see that Named Entity Recognition module extracts person, location and organization entities for my selected text column. If there are multiple entities in the text then each entity is expanded to a new row. Notice that Article ID is attached with each entity, the article ID is auto generated in the same order as the order of the provided dataset rows. Article ID starts with "0", since, in my sample dataset 0th row does not contain any entity therefore that row is not included in the result of NER module. Similarly, my provided dataset row number 3 contains multiple entities therefore each entity is expanded to new row but, attached Article ID is same. "Offset" is the starting position at which the recognized entity is found, "Length" is the size of recognized entity including spaces if any and finally, "Type" is the categorization of the recognized entity in person, location or organization.

8) For Next step, I want my resultant NER dataset to be combined with my existing dataset as a sparse matrix with one new column represents person entity, second new column represents location and third new column represents organization. For this matter, search for "Apply SQL Transformation" module in which I have written a SQLite query to group each entity into a single row by using article Id and split type column into three columns for each row. You can download the query provided above in this article. Below you can see connection of "Apply SQL Transformation" module in action i.e.

In the above image you can visualize that I have separated each row with "|" symbol and combine each column by comma "," symbol.

9) Let's combine our input dataset with the resultant dataset and form a sparse matrix. Know that I have already attached Article ID with my input dataset at pre-processing data transformation step. Search for "Join Data" module and use "Left Join" and combine the two datasets as shown below i.e.

10) Finally, download your resultant dataset as a CSV file. Search for "Convert to CSV" module and download your dataset as shown below i.e.


In this article, you will learn the technique to extract people, location and organization entities from multiple language textual dataset using Azure Machine Learning Studio Named Entity Recognition (NER) module. You will also learn to connect "Apply SQL Transformation" module, you will learn to use "Join Data" module to combine two datasets with Left Join and finally you will learn to use "Convert to CSV" module to download your resultant dataset into CSV file format.

No comments