PIRATES

Conversational AI 

ARC Disaster Response

Scope of work

ARC Conversational AI Cover.jpg

The Skinny

 
Project-overview.png

Overview

This proposal outlines the scope of building the first phase of a conversational AI software for ARC to handle disaster relief request calls.

 

Phase 1 of the conversational AI will take 12 weeks to implement.

The team will include 2 senior AI engineers, 2 senior full stack engineer, one QA engineer and one designer part time working on it.

The fees will be $225,600

Scope of project.png

Execution Plan

 
Execution.jpeg

AI Core Development

Developing the AI Core involves four main parts:

  • ETL of evergreen data from call center transcripts

  • Building a language model

  • Entity relationships

  • Conversation flows for each of the three channels

 

The team will be staffed with a mix of data scientists, AI/ML engineers and full stack developers from Pirates’s network of startups.

Team.png

Team

 Developing the AI core

AI Bot dev cover.png
  1. Data collection & processing

    ARC over many years has built a knowledge base of responses for queries that come up during disasters. It has noticed that responses to most of these queries are similar irrespective of the disaster and sees an opportunity to automate these responses and delegate only critical or unanswered queries to call center staff.

 

Access and transcribe voice data

Data extract.jpg

The first step would hence be to extract this knowledge base or evergreen data. This step would involve the following sub-steps:

  1. Access the voice chat transcripts

  2. Save the voice chat transcripts in a cloud infrastructure

  3. Transcribe the voice data into text

  4. Evaluate the accuracy of voice to text conversion using measures like Word Accuracy or Word Error Rate

 

Extract and index evergreen data

Once we are happy with the transcribing and the error rates are sufficiently low, we can move to analyse the text data and extract a knowledge base from it. Not all the text data we extract will directly relate to the knowledge base. A large part of it will be about how the staff greet callers or how they word responses based on the criticality etc… All of this will be equally important to make sure the chatbots give an experience similar to the call center staff. Once we have successfully identified different parts of the text data we will need to tag it and store it in a manner helpful for downstream ML algorithms.

  1. Tag text data by its function, like QUERY, GREETING, SHELTER_RESPONSE etc… The final list of these annotations will be an iterative process with ARC

  2. Index tagged text data so it can be efficiently updated and retrieved by downstream algorithms

  3. Save the tagged and indexed text data on a cloud database

Indexing.jpg
 

Evaluate evergreen data

Data evaluation.jpg

The text we have tagged and indexed should then be evaluated to see if the basic premise of our project is right. We will have to come up with numbers to quantify how many user queries can be given an automated response. The success of this project will depend on this evaluation step giving satisfactory results. The exact evaluation metrics will be hard to come up with now, but simple statistical measures like student’s t-distribution etc. should be adequate.

 

2. NLP and language modeling

NLP.png

The queries we get from users might run into thousands and there are multiple ways to word the same query in natural language. But while we can word the same sentence in many different ways, the underlying grammar of it will barely change. The specific grammar underlying all the queries we extract from call center data will be the language model for our chatbot(s).

This language model, models the grammar using many different constructs. Some examples include:

 

Important parts of speech and their attributes

As can be seen in the screenshot, for simple queries, just looking at the verb of a sentence encapsulates almost the entire sentence. To add to it, many verbs are synonymous and help to drastically simplify our grammar. Such approaches also help in indexing the extracted text data and makes query responses quicker and more accurate. Looking at parts of speech (like verbs) shield us from the many different ways users express queries in natural language. For eg. object and subject of a verb remain the same irrespective of whether the user uses active voice or passive voice.

NLP 1.png
 

Clause trees

NLP 2.png

Complicated sentences are simple clauses held together by conjunctions and prepositions. The problem of understanding longer sentences can be broken down to understanding the simpler clauses and how they’re connected by conjunctions and prepositions.

 

Vector mapping of words

Similar words appear in similar contexts. Such analysis is crucial to understand what the user is talking about even if they use words we haven’t seen before. It also helps navigate from one context to another and crucial to building such functionality is a vector representation of the words which appear in our corpus. An example of such a clustering of words (from the corpus of a travel chatbot) as vectors in 2 dimensions is shown above. In the upper right hand side of the graph, words “us-visa-waiver-program”, “esta”, “usa”, “b1-b2-visas” appear together. This is because such words always appear in the same context. I.e. they always have the same neighbouring words. So looks like when people are asking about the USA, they are interested about its visa procedures. Similarly, in the upper left hand side of the graph, words “london”, “public-transport”, “buses”, “transportation” appear together. And it looks like when people are asking about London, they are more interested in knowing how to get around in the city. Knowing the context and guessing it when we encounter words we haven’t seen during training is crucial for chatbots and will immensely help us for intents we are going to be handling using custom ML models and manual rules (discussed later).

Vector Map.png
 

Language model uses

Looking at the extracted text corpus “grammatically” via a language model has various uses:

  1. Helps us index the text corpus the right way so we can retrieve the right responses

  2. Gives us a sense of how complicated a problem we are dealing with 

  3. Will draw up an architecture of how we want to structure Intents, Entities and Slots for any slot-based chat system we might use like DialogFlow, Lex etc…

  4. For all un-avoidable shortcomings of cloud offerings like DialogFlow, language models give us a very reliable fall back to handle queries by defining simple manual rules (see upcoming section)

 

Frameworks and libraries we’ll use

Almost all functionality needed to build a language model is readily available in many local and cloud based SDKs. Parsers and POS taggers like Penn treebank have been around for many decades now and provide a high level of accuracy. Google cloud NLP’s syntax analysis API provides POS tagging and dependency information (for building clause trees) with just a single API call. We will hence rely on already available resources and not build any core functionality for language models from scratch. The only code we will write is to invoke these APIs, pass our dataset to them and to evaluate / visualize the results.

 

3. Building the chatbot entities and flows

DialogFlow.jpeg

The POS analysis helps us understand the intents users are expressing and also the slots needed to be filled for these intents. For e.x. Verbs usually are a very good indicator of intents we can use in slot-based cloud offerings like DialogFlow. Clubbing synonymous verbs together will greatly reduce the number of intents we will have to create and automatically structure the entities we would create in such slot-based systems. Such segregation of synonymous verbs also helps us in collating training utterances for intents and also to increase the number of training utterances for each intent. This usage of a language model to build our strategy for structuring intents etc. eliminates the following common pitfalls that occur while training slot based systems:

 

Scope for improvement from analyzing past ARC projects

1. Tagging entire sentences as slot values

Mistake 1.png
 

2. Specifying only 1 slot for an intent (with entire sentences as slot values)

Mistake 2.png
 

3. Not being consistent with tagging slot values and tagging the same value for multiple slots

The first 2 errors stem from users getting frustrated with the right intents not being caught. Users hence take short cuts by tagging entire sentences in the hope of catching the right intent. While such techniques do ensure intents are caught they fail to fill slot values. From the screenshots above, the query “what should i do for volunteering” catches the intent FAQTopicIntent because this entire sentence is tagged as a slot value for FAQ_Subtopic. But now we have no idea that the user is talking about volunteering because the slot value is “what should i do for volunteering” when it should’ve just been “volunteering”.

 

Improving chatbot accuracy using the language model

DialogFlow, Amazon Lex etc.. are great frameworks but address a more generic problem. These frameworks aim to solve ALL NLU tasks and are very hard to customise for smaller and more specific use cases. Because the underlying models of such frameworks are very generic they need millions of sentences to be trained properly. As discussed above, these frameworks often result in lower accuracy for smaller use cases with no way to improve them other than the hacks discussed above. But our strategy of using a language model allows to have a good sense of which intents will have a good chance of being caught properly and which won’t even before we start training such systems. For intents where we have good training data, we can rely on the ML models of DialogFlow etc… and avoid the pitfalls that occur during training by using automated POS tagging to help us in tagging slots and assigning utterances to intents. The training and response accuracy of these intents can be measured using simple scripts or using existing 3rd party services like dashboard.io. For intents where we do not have enough training data or the evaluation numbers are very low, we can formulate simple manual rules like so:

  • If the VERB is ‘donate’ with no OBJECT

    • Ask the user what they want to donate

  • If the VERB is ‘donate’ with a SUBJECT in the first person and an OBJECT

    • We should ask them to login to pull up their details

Such manual rules can slowly be phased out and move to DialogFlow once we have built enough training utterances for them. As our chatbots offerings grow, there will always be intents served by black box ML models from cloud systems and custom ML models / manual rules. With the custom ones slowly progressing to cloud systems as we build more training data for them.

 

4. Sentiment Analysis

Many cloud services provide ready API to analyze the sentiment of a “document”. In our case, a document will be just 2 - 3 lines of a user’s utterance. These services rely on the size of a document being large (like an article) to accurately predict its sentiment. Also, the definition of sentiment by such services does not map directly to what we are looking for. Such services concentrate more on the sentiment being positive or negative and not on the criticality expressed.

First step to build a sentiment analysis for our case is to tag phrases and words that have expressed criticality in our text corpus. We will then need to analyse these tags to see what parts of speech these critical words occur in. Another part of speech we would normally ignore are the punctuation marks that occur at the end of a sentence. Repeated exclamation marks for e.x. Many voice to text cloud offerings provide support to add punctuation in the converted text, so our call center data is also covered. Words, phrases tagged as critical, their parts of speech and neighbouring words along with non-critical words, their parts of speech and neighbouring words will form the training corpus for sentiment analysis.

With the training data, we will have to pick features to classify a query as critical or not. Examples of such features include:

  1. Number of repeated punctuation marks at the end of a query

  2. TF-IDF weighting of words w.r.t critical or not-critical tags

  3. TF-IDF weighting of neighboring / context words etc… 

This training data will be split into testing and training data using N-fold cross validation to make sure the ML is being tested on data it hasn’t seen before while ensuring that a good mix of the data is utilized for training.

Conversation flows

Conversation flows cover.jpg

IVR

robot-customer-service1.jpg

FB messenger bot

facebook-messenger-bots-logo.jpg

ARC website bot

Chat+bot+conversation+flow.jpg

After we are able to understand a query and respond to them, we will have to build a conversation flow tailored to each channel. Here is where we tailor a different conversation design for each channel while making them share entities, intents etc… This step is also where we refer to the indexed text corpus to respond to queries. These responses will also vary by the channel. 

 

Implement custom integrations for the 3 channels

Every channel’s custom conversation flow, will need to interface with different APIs (3rd party or within ARC). In this step we will be completing the conversation flow for each channel by integrating it with other services and fitting in the chatbots with the rest of ARC ecosystem.


 ARC Website Bot UX

chatbot-website-examples.jpg

The web chatbot is the only channel that will need a custom UI to be built. Building a UI for such a bot, keeping in mind ARC’s current website architecture involves the following steps:

  1. Research packages that provide a clean chat UI using react JS

  2. Build UI for the bot using a researched UI package

  3. Ensure the UI follows the guidelines provided by ARC

  4. Ensure UI is mobile responsive

  5. iFrame code to load react UI from our backend and make it work with the rest of the HTML website

  6. Code to show a loading UI while the iFrame loads 

 Building for the future

Scaling for future cover.jpg
T-MFL-012-Spanish-Title-Display-Lettering_ver_1.jpg

Support for Spanish (or any new language) will require us to revisit the steps we carried out for English. Following is an explanation of the steps and why we would have to revisit them:

Access voice or text data in the new language

  1. Transcribe any voice data to text and evaluate the performance of this transcription

  2. Annotate the text corpus with tags that we came up for the English text corpus. It is very unlikely that we will encounter tags that were used for the English corpus and not in the Spanish corpus and vice versa. For each of these cases we will need to understand why this is so and what the impact of it could be. For the purposes of this proposal, it is safe to assume that annotation tags used for English will suffice for Spanish as well.

  3. Build a new language model for the new language. It will be important to understand the new language so this step is unavoidable. But all the 3rd party services we will be using to build the language model will have support for Spanish as well. Our code that interfaces with these services will be written in a language agnostic way so both (or any future languages) can share the same codebase.

  4. Create entities, intents and slots for Spanish in DialogFlow (or any other cloud service). This again is a limitation of the cloud services. The cloud services need to be re-trained for the new language with new training phrases, intents etc… But as with English, this step will be driven by the language model and will hence make it very easy to train for a new language while avoiding the common pitfalls encountered during training. We will also keep the same intents for both languages with similar slots. This way we will be able to understand that a user is asking about shelter information whether they ask it in English or Spanish

All integrations with API outside the chatbot (APIs inside ARC or outside ARC) will be passed a language parameter with the current language of the user. This way the 3rd party API can respond in a language specific way.

 

Disaster specific vs generic content

Right from the initial stages of the project, we will make an effort to differentiate between dynamic information that is disaster specific (like shelter information) and static information that is evergreen. This dynamic information will be attempted to be filled by API like Shelter Lookup, Local chapter lookup etc... Outside of such information we will need to understand what information will change by disasters to scope out how to easily add and remove disaster specific content.


Execution Plan 

Project Plan Cover.jpg

We anticipate that a team of 2 senior AI engineers and 2 full stack engineers will be able to complete the phase 1 in about 12-15 weeks. If additional resources need to be used to complete the project, Pirates will do it without charging any additional fees to ARC.

Personnel Normal rates/hour ARC rates/hour ARC fees/week ARC fees 12 weeks
Sr. AI Engineer 1 $150 $125 $5,000 $60,000
Sr. AI Engineer 2 $150 $125 $5,000 $60,000
Sr. Full Stack Engineer 1 $125 $100 $4,000 $48,000
Sr. Full Stack Engineer 2 $125 $100 $4,000 $48,000
Designer $100 $80 $3,200 $9,600 (3 weeks)
Project Manager $125 comped comped comped
QA Engineer $75 comped comped comped
Total $225,600