
The Skinny
Overview
This proposal outlines the scope of building the first phase of a conversational AI software for ARC to handle disaster relief request calls.
Phase 1 of the conversational AI will take 12 weeks to implement.
The team will include 2 senior AI engineers, 2 senior full stack engineer, one QA engineer and one designer part time working on it.
The fees will be $225,600
Execution Plan
AI Core Development
Developing the AI Core involves four main parts:
ETL of evergreen data from call center transcripts
Building a language model
Entity relationships
Conversation flows for each of the three channels
The team will be staffed with a mix of data scientists, AI/ML engineers and full stack developers from Pirates’s network of startups.
Team
Developing the AI core

Data collection & processing
ARC over many years has built a knowledge base of responses for queries that come up during disasters. It has noticed that responses to most of these queries are similar irrespective of the disaster and sees an opportunity to automate these responses and delegate only critical or unanswered queries to call center staff.
Access and transcribe voice data
The first step would hence be to extract this knowledge base or evergreen data. This step would involve the following sub-steps:
Access the voice chat transcripts
Save the voice chat transcripts in a cloud infrastructure
Transcribe the voice data into text
Evaluate the accuracy of voice to text conversion using measures like Word Accuracy or Word Error Rate
Extract and index evergreen data
Once we are happy with the transcribing and the error rates are sufficiently low, we can move to analyse the text data and extract a knowledge base from it. Not all the text data we extract will directly relate to the knowledge base. A large part of it will be about how the staff greet callers or how they word responses based on the criticality etc… All of this will be equally important to make sure the chatbots give an experience similar to the call center staff. Once we have successfully identified different parts of the text data we will need to tag it and store it in a manner helpful for downstream ML algorithms.
Tag text data by its function, like QUERY, GREETING, SHELTER_RESPONSE etc… The final list of these annotations will be an iterative process with ARC
Index tagged text data so it can be efficiently updated and retrieved by downstream algorithms
Save the tagged and indexed text data on a cloud database
Evaluate evergreen data
The text we have tagged and indexed should then be evaluated to see if the basic premise of our project is right. We will have to come up with numbers to quantify how many user queries can be given an automated response. The success of this project will depend on this evaluation step giving satisfactory results. The exact evaluation metrics will be hard to come up with now, but simple statistical measures like student’s t-distribution etc. should be adequate.
2. NLP and language modeling
The queries we get from users might run into thousands and there are multiple ways to word the same query in natural language. But while we can word the same sentence in many different ways, the underlying grammar of it will barely change. The specific grammar underlying all the queries we extract from call center data will be the language model for our chatbot(s).
This language model, models the grammar using many different constructs. Some examples include:
Important parts of speech and their attributes
As can be seen in the screenshot, for simple queries, just looking at the verb of a sentence encapsulates almost the entire sentence. To add to it, many verbs are synonymous and help to drastically simplify our grammar. Such approaches also help in indexing the extracted text data and makes query responses quicker and more accurate. Looking at parts of speech (like verbs) shield us from the many different ways users express queries in natural language. For eg. object and subject of a verb remain the same irrespective of whether the user uses active voice or passive voice.
Clause trees
Complicated sentences are simple clauses held together by conjunctions and prepositions. The problem of understanding longer sentences can be broken down to understanding the simpler clauses and how they’re connected by conjunctions and prepositions.
Vector mapping of words
Similar words appear in similar contexts. Such analysis is crucial to understand what the user is talking about even if they use words we haven’t seen before. It also helps navigate from one context to another and crucial to building such functionality is a vector representation of the words which appear in our corpus. An example of such a clustering of words (from the corpus of a travel chatbot) as vectors in 2 dimensions is shown above. In the upper right hand side of the graph, words “us-visa-waiver-program”, “esta”, “usa”, “b1-b2-visas” appear together. This is because such words always appear in the same context. I.e. they always have the same neighbouring words. So looks like when people are asking about the USA, they are interested about its visa procedures. Similarly, in the upper left hand side of the graph, words “london”, “public-transport”, “buses”, “transportation” appear together. And it looks like when people are asking about London, they are more interested in knowing how to get around in the city. Knowing the context and guessing it when we encounter words we haven’t seen during training is crucial for chatbots and will immensely help us for intents we are going to be handling using custom ML models and manual rules (discussed later).
Language model uses
Looking at the extracted text corpus “grammatically” via a language model has various uses:
Helps us index the text corpus the right way so we can retrieve the right responses
Gives us a sense of how complicated a problem we are dealing with
Will draw up an architecture of how we want to structure Intents, Entities and Slots for any slot-based chat system we might use like DialogFlow, Lex etc…
For all un-avoidable shortcomings of cloud offerings like DialogFlow, language models give us a very reliable fall back to handle queries by defining simple manual rules (see upcoming section)
Frameworks and libraries we’ll use
Almost all functionality needed to build a language model is readily available in many local and cloud based SDKs. Parsers and POS taggers like Penn treebank have been around for many decades now and provide a high level of accuracy. Google cloud NLP’s syntax analysis API provides POS tagging and dependency information (for building clause trees) with just a single API call. We will hence rely on already available resources and not build any core functionality for language models from scratch. The only code we will write is to invoke these APIs, pass our dataset to them and to evaluate / visualize the results.
3. Building the chatbot entities and flows
The POS analysis helps us understand the intents users are expressing and also the slots needed to be filled for these intents. For e.x. Verbs usually are a very good indicator of intents we can use in slot-based cloud offerings like DialogFlow. Clubbing synonymous verbs together will greatly reduce the number of intents we will have to create and automatically structure the entities we would create in such slot-based systems. Such segregation of synonymous verbs also helps us in collating training utterances for intents and also to increase the number of training utterances for each intent. This usage of a language model to build our strategy for structuring intents etc. eliminates the following common pitfalls that occur while training slot based systems:
Scope for improvement from analyzing past ARC projects
1. Tagging entire sentences as slot values
2. Specifying only 1 slot for an intent (with entire sentences as slot values)
3. Not being consistent with tagging slot values and tagging the same value for multiple slots
The first 2 errors stem from users getting frustrated with the right intents not being caught. Users hence take short cuts by tagging entire sentences in the hope of catching the right intent. While such techniques do ensure intents are caught they fail to fill slot values. From the screenshots above, the query “what should i do for volunteering” catches the intent FAQTopicIntent because this entire sentence is tagged as a slot value for FAQ_Subtopic. But now we have no idea that the user is talking about volunteering because the slot value is “what should i do for volunteering” when it should’ve just been “volunteering”.
Improving chatbot accuracy using the language model
DialogFlow, Amazon Lex etc.. are great frameworks but address a more generic problem. These frameworks aim to solve ALL NLU tasks and are very hard to customise for smaller and more specific use cases. Because the underlying models of such frameworks are very generic they need millions of sentences to be trained properly. As discussed above, these frameworks often result in lower accuracy for smaller use cases with no way to improve them other than the hacks discussed above. But our strategy of using a language model allows to have a good sense of which intents will have a good chance of being caught properly and which won’t even before we start training such systems. For intents where we have good training data, we can rely on the ML models of DialogFlow etc… and avoid the pitfalls that occur during training by using automated POS tagging to help us in tagging slots and assigning utterances to intents. The training and response accuracy of these intents can be measured using simple scripts or using existing 3rd party services like dashboard.io. For intents where we do not have enough training data or the evaluation numbers are very low, we can formulate simple manual rules like so:
If the VERB is ‘donate’ with no OBJECT
Ask the user what they want to donate
If the VERB is ‘donate’ with a SUBJECT in the first person and an OBJECT
We should ask them to login to pull up their details
Such manual rules can slowly be phased out and move to DialogFlow once we have built enough training utterances for them. As our chatbots offerings grow, there will always be intents served by black box ML models from cloud systems and custom ML models / manual rules. With the custom ones slowly progressing to cloud systems as we build more training data for them.
4. Sentiment Analysis
Many cloud services provide ready API to analyze the sentiment of a “document”. In our case, a document will be just 2 - 3 lines of a user’s utterance. These services rely on the size of a document being large (like an article) to accurately predict its sentiment. Also, the definition of sentiment by such services does not map directly to what we are looking for. Such services concentrate more on the sentiment being positive or negative and not on the criticality expressed.
First step to build a sentiment analysis for our case is to tag phrases and words that have expressed criticality in our text corpus. We will then need to analyse these tags to see what parts of speech these critical words occur in. Another part of speech we would normally ignore are the punctuation marks that occur at the end of a sentence. Repeated exclamation marks for e.x. Many voice to text cloud offerings provide support to add punctuation in the converted text, so our call center data is also covered. Words, phrases tagged as critical, their parts of speech and neighbouring words along with non-critical words, their parts of speech and neighbouring words will form the training corpus for sentiment analysis.
With the training data, we will have to pick features to classify a query as critical or not. Examples of such features include:
Number of repeated punctuation marks at the end of a query
TF-IDF weighting of words w.r.t critical or not-critical tags
TF-IDF weighting of neighboring / context words etc…
This training data will be split into testing and training data using N-fold cross validation to make sure the ML is being tested on data it hasn’t seen before while ensuring that a good mix of the data is utilized for training.
Conversation flows

IVR
FB messenger bot
ARC website bot
After we are able to understand a query and respond to them, we will have to build a conversation flow tailored to each channel. Here is where we tailor a different conversation design for each channel while making them share entities, intents etc… This step is also where we refer to the indexed text corpus to respond to queries. These responses will also vary by the channel.
Implement custom integrations for the 3 channels
Every channel’s custom conversation flow, will need to interface with different APIs (3rd party or within ARC). In this step we will be completing the conversation flow for each channel by integrating it with other services and fitting in the chatbots with the rest of ARC ecosystem.
ARC Website Bot UX

The web chatbot is the only channel that will need a custom UI to be built. Building a UI for such a bot, keeping in mind ARC’s current website architecture involves the following steps:
Research packages that provide a clean chat UI using react JS
Build UI for the bot using a researched UI package
Ensure the UI follows the guidelines provided by ARC
Ensure UI is mobile responsive
iFrame code to load react UI from our backend and make it work with the rest of the HTML website
Code to show a loading UI while the iFrame loads
Building for the future

Support for Spanish (or any new language) will require us to revisit the steps we carried out for English. Following is an explanation of the steps and why we would have to revisit them:
Access voice or text data in the new language
Transcribe any voice data to text and evaluate the performance of this transcription
Annotate the text corpus with tags that we came up for the English text corpus. It is very unlikely that we will encounter tags that were used for the English corpus and not in the Spanish corpus and vice versa. For each of these cases we will need to understand why this is so and what the impact of it could be. For the purposes of this proposal, it is safe to assume that annotation tags used for English will suffice for Spanish as well.
Build a new language model for the new language. It will be important to understand the new language so this step is unavoidable. But all the 3rd party services we will be using to build the language model will have support for Spanish as well. Our code that interfaces with these services will be written in a language agnostic way so both (or any future languages) can share the same codebase.
Create entities, intents and slots for Spanish in DialogFlow (or any other cloud service). This again is a limitation of the cloud services. The cloud services need to be re-trained for the new language with new training phrases, intents etc… But as with English, this step will be driven by the language model and will hence make it very easy to train for a new language while avoiding the common pitfalls encountered during training. We will also keep the same intents for both languages with similar slots. This way we will be able to understand that a user is asking about shelter information whether they ask it in English or Spanish
All integrations with API outside the chatbot (APIs inside ARC or outside ARC) will be passed a language parameter with the current language of the user. This way the 3rd party API can respond in a language specific way.
Disaster specific vs generic content
Right from the initial stages of the project, we will make an effort to differentiate between dynamic information that is disaster specific (like shelter information) and static information that is evergreen. This dynamic information will be attempted to be filled by API like Shelter Lookup, Local chapter lookup etc... Outside of such information we will need to understand what information will change by disasters to scope out how to easily add and remove disaster specific content.
Execution Plan

We anticipate that a team of 2 senior AI engineers and 2 full stack engineers will be able to complete the phase 1 in about 12-15 weeks. If additional resources need to be used to complete the project, Pirates will do it without charging any additional fees to ARC.
Personnel | Normal rates/hour | ARC rates/hour | ARC fees/week | ARC fees 12 weeks |
---|---|---|---|---|
Sr. AI Engineer 1 | $150 | $125 | $5,000 | $60,000 |
Sr. AI Engineer 2 | $150 | $125 | $5,000 | $60,000 |
Sr. Full Stack Engineer 1 | $125 | $100 | $4,000 | $48,000 |
Sr. Full Stack Engineer 2 | $125 | $100 | $4,000 | $48,000 |
Designer | $100 | $80 | $3,200 | $9,600 (3 weeks) |
Project Manager | $125 | comped | comped | comped |
QA Engineer | $75 | comped | comped | comped |
Total | $225,600 |