This is based on one of my previous analyses for my clients. My client has an IT support department which receives telephone calls for support request. The team providing support is limited due to budget constraints and needs to do also onsite support as well as the aforementioned remote support.

Team diagram

My client wants to know how many persons should be available to pickup telephone calls in certain periods (hour interval, days of the week, month) in order to properly assign the technicians to either provide remote support or go onsite. Too many people picking up calls and there’s not enought work for everybody and onsite support will encounter delays. Not enough people to pickup telephone calls and the service quality will suffer (and possibly bring also penalties stipulated in the contract).

The questions that need answering are:

  1. What are the overall characteristics of the remote support telephone calls?
  2. How many technicians are required to pick up calls to fulfill the required service?

Due to confidentiality of data, the data related to the calls is not real, however I tried as much as possible to respect the distribution of calls during the day, week and periods of the year. In all other aspects, such as column data format, the imaginary data that I’m using for this analysis example is accurate.

In the real situation, the telephone calls data came in XML format exported by an Alcatel-Lucent ACD and imported by another Python script in a SQLite database. The database was used for an in-house build telephone console to visualize call activity in near real-time.

ACD - An automated call distribution system, commonly known as automatic call distributor (ACD), is a telephony device that answers and distributes incoming calls to a specific group of terminals or agents within an organization.

Data import and cleaning

First thing first, the data needs to be imported from the SQLite database and loaded into a Pandas dataframe for analysis.

import sqlite3
from sqlite3 import Error
import pandas as pd
import numpy as np
def create_connection(path):
    connection = None
    try:
        connection = sqlite3.connect(path)
        print("Connection to SQLite DB successful")

    except Error as e:
        print(f"The error '{e}' occurred")
        
    return connection
def execute_read_query(connection, query):
    cursor = connection.cursor()
    result = None
    try:
        cursor.execute(query)
        result = cursor.fetchall()
        return result
    except Error as e:
        print(f"The error '{e}' occurred")
connection = create_connection("./call_journal.db")

Connection to SQLite DB successful

Data filtering sometimes starts with the SQL query. In my case, I’m only selecting data that I judged to be relevant to my analysis. However, it’s good to keep in mind that left out columns may contain relevant or crucial information. It’s best that in some cases the initial DB is reinvestigated to dig out more useful information.

select_call_records = '''
    SELECT TicketType,
        ChargedUserID,
        SubscriberName,
        CommunicationType,
        Date,
        Time,
        DateTime,
        CallDuration,
        CallDurationMV,
        RingingDuration

    FROM call_journal'''

db_cols = ["TicketType",
        "ChargedUserID",
        "SubscriberName",
        "CommunicationType",
        "Date",
        "Time",
        "DateTime",
        "CallDuration",
        "CallDurationMV",
        "RingingDuration"]

call_records = execute_read_query(connection, select_call_records)

df = pd.DataFrame(call_records, columns = db_cols)
display(df.sample(5))
TicketType ChargedUserID SubscriberName CommunicationType Date Time DateTime CallDuration CallDurationMV RingingDuration
42299 Call 15 POSTE 6 IncomingTransfer 2020-05-18 14:25:00 2020-05-18 14:25:00 00:05:27 0:00:00 00:00:00
39961 Call 44 POSTE 8 Outgoing 2020-03-11 08:39:00 2020-03-11 08:39:00 00:00:05 0:00:00 00:00:00
45878 Call 33 MV 2 IncomingTransfer 2020-06-15 13:36:00 2020-06-15 13:36:00 0:00:00 00:00:04 00:00:00
36741 Call 19 ACD Incoming 2020-02-21 14:04:00 2020-02-21 14:04:00 0:00:00 00:00:10 00:00:00
9003 Call 4 POSTE 1 IncomingTransfer 2018-11-21 11:41:00 2018-11-21 11:41:00 0:03:16 0:00:00 0:00:06
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 63670 entries, 0 to 63669
Data columns (total 10 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   TicketType         63670 non-null  object
 1   ChargedUserID      63670 non-null  object
 2   SubscriberName     63588 non-null  object
 3   CommunicationType  63670 non-null  object
 4   Date               63670 non-null  object
 5   Time               63670 non-null  object
 6   DateTime           63670 non-null  object
 7   CallDuration       63670 non-null  object
 8   CallDurationMV     63670 non-null  object
 9   RingingDuration    63670 non-null  object
dtypes: object(10)
memory usage: 4.9+ MB

The important thing after data import is to try to understand what data is presented in the columns. Not being aware of what’s the purpose of the data or how it was generated could give innacurate results. I’ll come with examples below.

The documentation of the source of information (in my case the ACD) was the first step in understanding how it generates the data. After going through (way too many) documentation pages, I found the following:

  1. calls are registered as tickets in the ACD (tickets are single units of information of certain operations, like calls, agent activity etc.)
  2. calls are received first by SubscriberName ACD, then transfered to SubscriberName agents (POSTE 1, POSTE 2, MV etc.) for call pickup or sent to voicemail. This is a very important piece of information as the ACD registers once the call when it reaches the system and another ticket when it’s transfered and answered by the agent. Basically, the total number of calls is actually HALF. Example below
    SubscriberName CommunicationType DateTime DialledNumber
    ACD Incoming 2019-11-28 10:38:00 0750.123.456
    POSTE 1 IncomingTransfer 2019-11-28 10:38:00 0750.123.456

First, the call reaches ACD, then the same call gets transfered to agent (POSTE 1).

Call -- received --> ACD -- transfered --> POSTE 1

Since we are only interested in analyzing unique calls and not how they are routed, we need to drop all rows with SubscriberName == ACD to get the real numbers.

  1. The missed calls are not marked specifically. The call routing logic is that it’s either answered by someone or it’s redirected to voicemail.
    SubscriberName CommunicationType DateTime DialledNumber
    ACD Incoming 2019-11-28 16:27:00 0750.123.456
    MV 1 IncomingTransfer 2019-11-28 16:28:00 0750.123.456

The call reaches the ACD, no one can pickup the call, and the call is redirected to voicemail MV1.

Call -- received --> ACD -- transfered --> POSTE 1

Again, same history as before, for the same call, 2 tickets are made. We’ll drop the lines with SubscriberName = ACD

df = df.loc[df["SubscriberName"] != "ACD"]

Now, we’ll continue looking into the other columns.

df["TicketType"].unique()

array(['Call'], dtype=object)

Since “TicketType” doesn’t hold other values than “Call”, it has no value to us, so we’ll drop it.

df.drop(labels = ["TicketType"], axis = 1, inplace = True)
df["ChargedUserID"].unique()

array(['13', '14', '10', '33', '32', '44', '11', '15', '12', '59', '30', '42', '01', '04', '03', '05', '02', '06', '07', '08', '43'], dtype=object)

df["SubscriberName"].unique()

array(['POSTE 4', 'POSTE 5', 'POSTE 1', 'MV 2', 'MV 1', 'POSTE 8', 'POSTE 2', 'POSTE 6', 'POSTE 3', 'POSTE 7'], dtype=object)

df["CommunicationType"].unique()

array(['IncomingTransfer', 'Outgoing', 'OutgoingTransfer', 'OutgoingTransferTransit', 'IncomingTransferTransit'], dtype=object)

This will be filtered as well, as we’re only interested in incoming calls. One curious this is the “IncomingTransferTransit” value. We’ll check it out:

df.loc[df["CommunicationType"] == 'IncomingTransferTransit']
ChargedUserID SubscriberName CommunicationType Date Time DateTime CallDuration CallDurationMV RingingDuration
2329 14 POSTE 5 IncomingTransferTransit 2019-11-05 13:12:00 2019-11-05 13:12:00 00:00:09 0:00:00 00:00:00

Unfortunately, I could not find something relevand in the documentation and, since it’s only one value, we’ll drop it.

df = df.loc[df["CommunicationType"] != 'IncomingTransferTransit']

One last thing to check is for calls on Saturday and Sunday. During an initial run of this report, the data contained calls received on saturday and sunday.

df["Date"] = pd.to_datetime(df["Date"], format = "%Y-%m-%d")
df["Day_nb"] = df["Date"].dt.dayofweek
df = df.loc[~df["Day_nb"].isin([5,6])]
df.drop(labels = "Day_nb", axis = 1, inplace = True)

As a last check, we’ll look at the total values in columns, to check whether we need to dig deeper into the data.

df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 44890 entries, 0 to 63668
Data columns (total 9 columns):
  #   Column             Non-Null Count  Dtype         
---  ------             --------------  -----         
  0   ChargedUserID      44890 non-null  object        
  1   SubscriberName     44890 non-null  object        
  2   CommunicationType  44890 non-null  object        
  3   Date               44890 non-null  datetime64[ns]
  4   Time               44890 non-null  object        
  5   DateTime           44890 non-null  object        
  6   CallDuration       44890 non-null  object        
  7   CallDurationMV     44890 non-null  object        
  8   RingingDuration    44890 non-null  object        
dtypes: datetime64[ns](1), object(8)
memory usage: 3.4+ MB

All looks OK so far, we’ll proceed to setting the columns data types

Column data format

Next, we’ll set the proper type of certain columns

df["ChargedUserID"] = df["ChargedUserID"].astype("category")
df["SubscriberName"] = df["SubscriberName"].astype("category")
df["CommunicationType"] = df["CommunicationType"].astype("category")

df["Time"] = pd.to_datetime(df["Date"], format = "%H:%M:%S")
df["DateTime"] = pd.to_datetime(df["DateTime"], format = "%Y-%m-%d %H:%M:%S")

df["CallDuration"] = pd.to_timedelta(df["CallDuration"])
df["CallDurationMV"] = pd.to_timedelta(df["CallDurationMV"])
df["RingingDuration"] = pd.to_timedelta(df["RingingDuration"])
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 44890 entries, 0 to 63668
Data columns (total 9 columns):
 #   Column             Non-Null Count  Dtype          
---  ------             --------------  -----          
 0   ChargedUserID      44890 non-null  category       
 1   SubscriberName     44890 non-null  category       
 2   CommunicationType  44890 non-null  category       
 3   Date               44890 non-null  datetime64[ns] 
 4   Time               44890 non-null  datetime64[ns] 
 5   DateTime           44890 non-null  datetime64[ns] 
 6   CallDuration       44890 non-null  timedelta64[ns]
 7   CallDurationMV     44890 non-null  timedelta64[ns]
 8   RingingDuration    44890 non-null  timedelta64[ns]
dtypes: category(3), datetime64[ns](3), timedelta64[ns](3)
memory usage: 2.5 MB

Data Analysis

We’ll start with some basic data exploration, to see some trends. Since we’re only interested in incoming calls, we’ll create a mask and filter the data

import altair as alt
mask_incoming = (df["CommunicationType"] == "IncomingTransfer")
df = df.loc[mask_incoming]
df["Answered"] = 0
df.loc[df["SubscriberName"].str.startswith('POSTE'), "Answered"] = 1
df.sample(5)
ChargedUserID SubscriberName CommunicationType Date Time DateTime CallDuration CallDurationMV RingingDuration Answered
59243 33 MV 2 IncomingTransfer 2020-09-08 2020-09-08 2020-09-08 09:00:00 0 days 00:00:00 0 days 00:00:28 0 days 00:00:00 0
14136 5 POSTE 6 IncomingTransfer 2019-03-06 2019-03-06 2019-03-06 10:37:00 0 days 00:03:44 0 days 00:00:00 0 days 00:00:06 1
45790 33 MV 2 IncomingTransfer 2020-06-15 2020-06-15 2020-06-15 11:18:00 0 days 00:00:00 0 days 00:00:03 0 days 00:00:00 0
57021 32 MV 1 IncomingTransfer 2020-08-26 2020-08-26 2020-08-26 08:45:00 0 days 00:00:00 0 days 00:00:05 0 days 00:00:00 0
830 13 POSTE 4 IncomingTransfer 2019-07-25 2019-07-25 2019-07-25 08:49:00 0 days 00:03:47 0 days 00:00:00 0 days 00:00:00 1

Total calls per day

total_per_day = df.groupby("Date")["CommunicationType"].count().reset_index().rename({"CommunicationType": "Total"}, axis = 1)
total_per_day.sample(5)
Date Total
130 2019-04-18 47
409 2020-08-05 53
195 2019-07-26 66
329 2020-02-18 139
337 2020-02-28 53
alt.Chart(total_per_day).mark_trail().encode(
    x = "Date:T",
    y = 'Total',
    size = 'Total:Q'
).properties(
    width = 600
)

Total calls per day - chart

We can see that there is a huge chunk of data missing in April and May. The missing data is due to the 2020 COVID confinement.

Also, the chart is overcrowded, nothing is really visible here. Let’s remake it into a boxplot chart, to have more details.

boxplot = df.groupby("Date")["CommunicationType"].count().reset_index().rename({"CommunicationType": "Total"}, axis = 1)
boxplot["Date"] = pd.to_datetime(boxplot["Date"], format = "%H:%M:%S")
boxplot["YearMonth"] = boxplot['Date'].dt.to_period('M').dt.strftime('%Y-%m')
boxplot.sample(5)
Date Total YearMonth
122 2019-04-08 75 2019-04
216 2019-08-28 68 2019-08
260 2019-10-29 135 2019-10
82 2019-02-11 85 2019-02
29 2018-11-23 47 2018-11
alt.Chart(boxplot).mark_boxplot().encode(
    x='YearMonth:O',
    y='Total:Q'
).properties(
    width=600
)

Total calls per month - boxplot

We can extract some info now:

  • the average held between 60 to 100 in 2019, however the average ammount of calls increased in 2020.
  • there is considerable variation in min / max calls from month to month (probably due to season variations) and this will require more precise estimates on manpower.
  • there are outliers for max values, most likely indicating general incidents in the infrastructure that generate peak calls to the service. This should be adressed by an emergency service redirection.

Total calls per month - volume

total_per_month = df.groupby("Date")["CommunicationType"].count().reset_index().rename({"CommunicationType": "Total"}, axis = 1)
total_per_month.sample(5)
Date Total
130 2019-04-18 47
86 2019-02-15 49
262 2019-10-31 131
286 2019-12-17 68
302 2020-01-10 3
alt.Chart(total_per_month).mark_bar().encode(
    x = "yearmonth(Date):O",
    y = 'Total',
    size = 'Total:Q'
).properties(
    width = 600
)

Total calls per month - chart

total_heatmap = df.groupby("Date")["CommunicationType"].count().reset_index().rename({"CommunicationType": "Total"}, axis = 1)
heatmap_per_day = alt.Chart(total_heatmap).mark_rect().encode(
    x='date(Date):O',
    y='yearmonth(Date):O',
    color= alt.Color('Total:Q', scale=alt.Scale(scheme='goldred'))
).properties(
    width=600
    )
heatmap_per_day

Total calls per month - heatmap chart

heatmap_per_day_of_week = alt.Chart(total_heatmap.rename({"CommunicationType": "Total"}, axis = 1)).mark_rect().encode(
    x='day(Date):O',
    y='yearmonth(Date):O',
    color= alt.Color('Total:Q', scale=alt.Scale(scheme='goldred'))
).properties(
    width=600
    )
heatmap_per_day_of_week

Total calls per month and day of week - heatmap chart

Estimator

Given a level of service, what is the minimum required ammount of agents to fulfill the service in question?

We’ll check this by constructing time intervals from calls start and end time and we’ll check at a certain frequency how many time intervals overlap. Since only one person can respond to a call, the total number of overlaps is the number of required agents.

Let’s look at a simple example:

Call diagram

In the interval 8h00 - 8h30, we received 2 calls that were answered and we missed one call. In order to properly estimate the time occupation, we’ll add 5min (the average conversation time) to missed calls. So, for this interval, we needed 3 agents to answer. In the next interval, 8h30 - 9h00, we still have 2 agents on call (Call1 and Call2), we assume that there is another fictive agent on missed Call3 and we receive another call that we missed. In total, for this interval, we would need 4 agents. In total, we can see that we need:

Interval Agents needed
8h00 - 8h30 3
8h30 - 9h00 4
9h00 - 9h30 2

The time interval to check for call status will be smaller (5 min) in order to be more precise in our results.

df["call_start_dt"] = df["DateTime"]

df["call_end_dt"] = np.nan
df.loc[df["Answered"] == 1, "call_end_dt"] = df["DateTime"] + df["CallDuration"]

average_call_time = df.loc[df["Answered"] == 1,"CallDuration"].mean()
average_call_time = average_call_time.floor("s")
average_call_time
Timedelta('0 days 00:05:11')
df.loc[df["Answered"] == 0, "call_end_dt"] = df["DateTime"] + pd.to_timedelta(average_call_time, unit = "S")
df.loc[:,("DateTime", "CallDuration", "call_start_dt", "call_end_dt")].sample(5)
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

def create_interval(df):
    df["Time_interval"] = pd.Interval(df["call_start_dt"], df["call_end_dt"])
    return df

df["Time_interval"] = np.nan
df = df.apply(create_interval, axis = 1)
df.loc[:("call_start_dt", "call_end_dt", "Time_interval")].head()
call_start_dt call_end_dt Time_interval
0 2019-12-20 14:03:00 2019-12-20 14:12:43 (2019-12-20 14:03:00, 2019-12-20 14:12:43]
3 2019-12-20 14:27:00 2019-12-20 14:29:14 (2019-12-20 14:27:00, 2019-12-20 14:29:14]
4 2019-12-20 14:28:00 2019-12-20 14:30:07 (2019-12-20 14:28:00, 2019-12-20 14:30:07]
6 2019-12-20 14:35:00 2019-12-20 14:36:10 (2019-12-20 14:35:00, 2019-12-20 14:36:10]
10 2019-12-20 14:40:00 2019-12-20 14:41:45 (2019-12-20 14:40:00, 2019-12-20 14:41:45]

Constructing time intervals between date A and date B, each day with opening hour X and closing hour Y is … tricky :(

unique_date_list = df["Date"].unique()
date_range_start = []
date_range_end = []
frequency = pd.Timedelta("5 minutes")

for unique_date in unique_date_list:   
    day_range_start = pd.date_range(start = unique_date + pd.Timedelta("7 hours 30 minutes"),
                             end = unique_date + pd.Timedelta("18 hours 30 minutes") - frequency,
                             freq = frequency
                           )
    day_range_end = pd.date_range(start = unique_date + pd.Timedelta("7 hours 30 minutes") + frequency,
                             end = unique_date + pd.Timedelta("18 hours 30 minutes"),
                             freq = frequency
                           )
    
    date_range_start += day_range_start
    date_range_end += day_range_end
    
verification_range = pd.IntervalIndex.from_arrays(date_range_start, date_range_end)
verification_range
IntervalIndex([(2019-12-20 07:30:00, 2019-12-20 07:35:00], (2019-12-20 07:35:00, 2019-12-20 07:40:00], (2019-12-20 07:40:00, 2019-12-20 07:45:00], (2019-12-20 07:45:00, 2019-12-20 07:50:00], (2019-12-20 07:50:00, 2019-12-20 07:55:00] ... (2020-10-01 18:05:00, 2020-10-01 18:10:00], (2020-10-01 18:10:00, 2020-10-01 18:15:00], (2020-10-01 18:15:00, 2020-10-01 18:20:00], (2020-10-01 18:20:00, 2020-10-01 18:25:00], (2020-10-01 18:25:00, 2020-10-01 18:30:00]],
              closed='right',
              dtype='interval[datetime64[ns]]')
result = []
for verification_interval in verification_range:
    counter = 0
    
    # we're filtering the main dataframe as there is no point in searching for overlaps in the next day
    mask_filter_left = (df["call_start_dt"] > verification_interval.left - pd.Timedelta("24h"))
    mask_filter_right = (df["call_end_dt"] < verification_interval.right + pd.Timedelta("24h"))
    
    call_intervals = df.loc[(mask_filter_left & mask_filter_right), "Time_interval"]
    
    for call_interval in call_intervals:
        if verification_interval.overlaps(call_interval):
            counter += 1      
    result.append({"time_interval": verification_interval,
                  "total_overlaps": counter})
estimator = pd.DataFrame(result)
estimator.sample(5)
time_interval total_overlaps
13575 (2019-01-09 16:45:00, 2019-01-09 16:50:00] 2
58858 (2020-09-24 17:20:00, 2020-09-24 17:25:00] 1
28128 (2019-06-21 08:30:00, 2019-06-21 08:35:00] 1
52824 (2020-07-23 09:30:00, 2020-07-23 09:35:00] 1
24577 (2019-05-10 09:35:00, 2019-05-10 09:40:00] 3

We got our results, but we’ll only keep the starting date of the interval, it’s much easier to construct graphs with simple dates than time intervals.

estimator["DateTime"] = estimator["time_interval"].apply(lambda x: x.left)
estimator = estimator.loc[:,("total_overlaps","DateTime")].rename({"total_overlaps": "Total"}, axis = 1)
estimator.sample(5)
Total DateTime
34381 1 2019-09-04 12:35:00
46296 2 2020-05-11 15:30:00
33707 2 2019-08-28 11:25:00
38821 3 2019-10-22 08:35:00
38847 1 2019-10-22 10:45:00

Average / Min / Max nb of agents required per hour per day

estimator["Hour"] = estimator["DateTime"].dt.hour
estimator["Day_nb"] = estimator["DateTime"].dt.day
estimator.sample(5)
Total DateTime Hour Day_nb
16620 0 2019-02-11 17:30:00 17 11
35779 0 2019-09-19 08:05:00 8 19
41340 6 2020-01-27 09:30:00 9 27
14365 1 2019-01-17 16:35:00 16 17
18406 1 2019-03-01 12:20:00 12 1
def calc_agg(df):
    df["Total_mean"] = df["Total"].mean()
    df["Total_mean"] = np.ceil(df["Total_mean"])
    df["Total_min"] = df["Total"].min()
    df["Total_max"] = df["Total"].max()
    
    return df

heatmap_table = estimator.groupby(["Day_nb", "Hour"]).apply(calc_agg).reset_index(drop = True)
heatmap_table
Total DateTime Hour Day_nb Total_mean Total_min Total_max
639 0 2019-12-27 16:45:00 16 27 2.0 0 5
33497 1 2019-08-26 15:55:00 15 26 2.0 0 10
46958 0 2020-05-18 15:40:00 15 18 2.0 0 6
43107 1 2020-02-13 13:45:00 13 13 1.0 0 7
56781 3 2020-09-03 09:15:00 9 3 3.0 0 9
alt.data_transformers.disable_max_rows()

base = alt.Chart(heatmap_table).encode(
    x='Hour:O',
    y='Day_nb:O'
    )

color_avg = base.mark_rect().encode(
    color= alt.Color('Total_mean:Q', scale=alt.Scale(scheme='goldred'))
    )

text_avg = base.mark_text(baseline='middle').encode(
    text = 'max(Total_mean):Q'
    )

heatmap = color_avg + text_avg
heatmap

Agents need