Classifier Analysis Example
You are a data steward at a software company that released a new smartphone application. The company wants to understand the public response to the application and the media coverage it receives. The company asks you and your team to analyze social media comments about the application.
You decide to capture data from twitter feeds that discuss smartphones. You use the twitter application programming interface to filter the twitter data stream. You create a data source that contains the twitter data you want to analyze.
Because the twitter feeds contain messages in multiple languages, you must identify the language used in each message. You decide to use a Classifier transformation to analyze the languages. You create a mapping that identifies the languages in the source data and writes the twitter messages to English and non-English data targets.
Create the Classifier Mapping
You create a mapping that reads a data source, classifies the languages in the data, and writes the data to targets based on the languages they contain.
The following image shows the mapping in the Developer tool:
The mapping you create contains the following objects:
Object Name | Description |
---|
Read_tweet_user_lang | Data source. Contains the twitter messages |
Classifier | Classifier transformation. Identifies the languages used in the twitter messages. |
Router | Router transformation. Routes the twitter messages to data target objects according to the languages they contain. |
Write_en_tweets_out | Data target. Contains twitter messages in English. |
Write_other_tweets_out | Data target. Contains non-English-language twitter messages. |
Input Data Sample
The following data fragment shows a sample of the twitter data that you analyze in the mapping:
Twitter Message |
---|
RT @GanaphoneS3: Faltan 10 minutos para la gran rifa de un iPhone 5... RT @Clarified: How to Downgrade Your iPhone 4 From iOS 6.x to iOS 5.x (Mac)... RT @jerseyjazz: The razor was the iPhone of the early 2000s RT @KrissiDevine: Apple Pie that I made for Thanksgiving. http://t.com/s9ImzFxO RT @sophieHz: Dan yang punya 2 kupon undian. Masuk dalam kotak undian yang berhadiah Samsung RT @IsabelFreitas: o galaxy tem isso isso isso e a bateria à melhor que do iPhone RT @PremiusIpad: Faltan 15 minutos para la gran rifa de un iPhone 5... RT @payyton3: I want apple cider RT @wiesteronder: Retweet als je iets van Apple, Nike, Adidas of microsoft hebt! |
Data Source Configuration
The data source contains a single port. Each row on the port contains a single twitter message.
The following table describes the configuration of the data source:
Port Name | Port Type | Precision |
---|
text | n/a | 200 |
Classifier Transformation Configuration
The Classifier transformation uses a single input port and output port. The transformation input port reads the text field from the data source. The output port contains the language identified for each twitter message in the text field. The Classifier transformation uses ISO country codes to identify the language.
The following table describes the configuration of the Classifier transformation:
Port Name | Port Type | Precision | Strategy |
---|
text_input | Input | 200 | Classifier1 |
Classifier_Output | Output | 2 | Classifier1 |
Router Transformation Configuration
The Router transformation uses two input ports. It reads the twitter messages from the data source and the ISO country codes from the Classifier transformation. The Router transformation routes the data on the input ports to different output ports based on a condition that you specify.
The following image shows the Router transformation port groups and port connections:
The following table describes the configuration of the Router transformation:
Port Name | Port Type | Port Group | Precision |
---|
Classifier_Output | Input | Input | 2 |
text | Input | Input | 200 |
Classifier_Output | Input | Default | 2 |
text | Input | Default | 200 |
Classifier_Output | Input | En_Group | 2 |
text | Input | En_Group | 200 |
You configure the transformation to create data streams for English-language messages and for messages in other languages. To create a data stream, add an output port group to the transformation. Use the Groups options on the transformation to add the port group.
To determine how the transformation routes data to each data stream, you define a condition on a port group. The condition identifies a port and specifies a possible value on the port. When the transformation finds an input port value that matches the condition, it routes the input data to the port group that applies the condition.
Define the following condition on the En_Group:
ClassifierOutput='en'
Note: The Router transformation reads data from two objects in the mapping. The transformation can combine the data in each output group because it does not alter the row sequence defined in the data objects.
Data Target Configuration
The mapping contains a data target for English-language twitter messages and a target for messages in other languages. You connect the ports from a Router transformation output group to a data target.
The following table describes the configuration of the data targets:
Port Name | Port Type | Precision |
---|
text | n/a | 200 |
Classifier_Output | n/a | 2 |
Classifier Mapping Outcome
When you run the mapping, the Classifier transformation identifies the language of each twitter message. The Router transformation writes the message text to data targets based on the language classifications.
The following data fragment shows a sample of the English-language target data:
ISO Country Code | Twitter Message |
---|
en | RT @Clarified: How to Downgrade Your iPhone 4 From iOS 6.x to iOS 5.x (Mac)... |
en | RT @jerseyjazz: The razor was the iPhone of the early 2000s |
en | RT @KrissiDevine: Apple Pie that I made for Thanksgiving. http://t.com/s9ImzFxO |
en | RT @payyton3: I want apple cider |
The following data fragment shows a sample of the target data identified for other languages:
ISO Country Code | Twitter Message |
---|
es | RT @GanaphoneS3: Faltan 10 minutos para la gran rifa de un iPhone 5... |
id | RT @sophieHz: Dan yang punya 2 kupon undian. Masuk dalam kotak undian yang berhadiah Samsung Champ. |
pt | RT @IsabelFreitas: o galaxy tem isso isso isso e a bateria à melhor que do iPhone |
es | RT @PremiusIpad: Faltan 15 minutos para la gran rifa de un iPhone 5... |
nl | RT @wiesteronder: Retweet als je iets van Apple, Nike, Adidas of microsoft hebt! http://t.co/Je6Ts00H |