Part 3: Modelling Ops queues
Today, I would like to show you an example of the discrete-event simulation approach. We will model the Customer Support team and decide what strategy to use to improve its performance. But first, let me share a bit of my personal story.
I first learned about discrete simulations at university. One of my subjects was Queueing theory, and to get a final grade for it, I had to implement the airport simulation and calculate some KPIs. Unfortunately, I missed all the seminars because I was already working full-time, so I had no idea about the theory behind this topic and how to approach it.
I was determined to get an excellent mark, so I found a book, read it, understood the basics and spent a couple of evenings on implementation. It was pretty challenging since I hadn’t been coding for some time, but I figured it out and got my A grade.
At this point (as often happens with students), I had a feeling that this information wouldn’t be helpful for my future work. However, later, I realised that many analytical tasks can be solved with this approach. So, I would like to share it with you.
One of the most apparent use cases for agent-based simulations is Operational analytics. Most products have customer support where clients can get help. A CS team often looks at such metrics as:
- average resolution time — how much time passed from the customer reaching out to CS and getting the first answer,
- size of the queue that shows how many tasks we have in a backlog right now.
Without a proper model, it may be tricky to understand how our changes (i.e. introducing night shifts or just increasing the number of agents) will affect the KPIs. Simulations will help us do it.
So, let’s not waste our time and move on.
Basics of simulations and modelling
Let’s start from the very beginning. We will be modelling the system. The system is a collection of entities (for example, people, servers or even mechanical tools) that interact with each other to achieve some logical goal (i.e. answering a customer question or passing border control in an airport).
You could define the system with the needed granularity level, depending on your research goal. For example, in our case, we would like to investigate how the changes to agents’ efficiency and schedules could affect average CS ticket resolution time. So, the system will be just a set of agents. However, if we would like to model the possibility of outsourcing some tickets to different outsourcing companies, we will need to include these partners in our model.
The system is described by a set of variables — for example, the number of tickets in a queue or the number of agents working at the moment in time. These variables define the system state.
There are two types of systems:
- discrete — when the system state changes instantaneously, for example, the new ticket has been added to a queue or an agent has finished their shift.
- continuous — when the system is constantly evolving. One such example is a flying plane, in which coordinates, velocity, height, and other parameters change all the time during flight.
For our task, we can treat the system as discrete and use the discrete-event simulation approach. It’s a case when the system can change at only a countable number of points in time. These time points are where events occur and instantly change the system state.
So, the whole approach is based on events. We will generate and process events one by one to simulate how the system works. We can use the concept of a timeline to structure events.
Since this process is dynamic, we need to keep track of the current value of simulated time and be able to advance it from one value to another. The variable in a simulation model that shows the current time is often called the simulation clock.
We also need a mechanism to advance simulated time. There are two approaches to advance time:
- next-event time advance — we are moving from one event timestamp to the next one,
- fixed-increment time advance — we select the period, for example, 1 minute, and shift clocks each time for this period.
I think the first approach is easier to understand, implement and debug. So, I will stick to it for this article.
Let’s review a simple example to understand how it works. We will discuss a simplified case of the CS tickets queue.
We start the simulation, initialising the simulation clock. Sometimes, people use zero as the initial value. I prefer to use real-life data and the actual date times.
Here’s the initial state of our system. We have two events on our timeline related to two incoming customer requests.
The next step is to advance the simulation clock to the first event on our timeline — the customer request at 9:15.
It’s time to process this event. We should find an agent to work on this request, assign the request to them, and generate an event to finish the task. Events are the main drivers of our simulation, so it’s okay if one event creates another one.
Looking at the updated timeline, we can see that the most imminent event is not the second customer request but the completion of the first task.
So, we need to advance our clock to 9:30 and process the next event. The completion of the request won’t create new events, so after that, we will move to the second customer request.
We will repeat this process of moving from one event to another until the end of the simulation.
To avoid never-ending processes, we need to define the stopping criteria. In this case, we can use the following logic: if no more events are on the timeline, we should stop the simulation. In this simplified example, our simulation will stop after finishing the second task.
We’ve discussed the theory of discrete event simulations and understood how it works. Now, it’s time to practice and implement this approach in code.
The program architecture
Objective-oriented programming
In my day-to-day job, I usually use a procedural programming paradigm. I create functions for some repetitive tasks, but rather than that, my code is quite linear. It’s pretty standard approach for data-wrangling tasks.
In this example, we would use Objective-Oriented Programming. So, let’s spend some time revising this topic if you haven’t used classes in Python before or need a refresher.
OOP is based on the concept of objects. Objects consist of data (some features that are called attributes) and actions (functions or methods). The whole program describes the interactions between different objects. For example, if we have an object representing a CS agent, it can have the following properties:
- attributes: name, date when an agent started working, average time they spend on tasks or current status (
"out of office"
,"working on task"
or"free"
). - methods: return the name, update the status or start processing a customer request.
To represent such an object, we can use Python classes. Let’s write a simple class for a CS agent.
class CSAgent:
# initialising class
def __init__(self, name, average_handling_time):
# saving parameters mentioned during object creation
self.name = name
self.average_handling_time = average_handling_time
# specifying constant value
self.role = 'CS agent'
print('Created %s with name %s' % (self.role, self.name))def get_name(self):
return self.name
def get_handling_time(self):
return self.average_handling_time
def update_handling_time(self, average_handling_time):
print('Updating time from %.2f to %.2f' % (self.average_handling_time,
average_handling_time))
self.average_handling_time = average_handling_time
This class defines each agent’s name, average handling time, and role. I’ve also added a couple of functions that can return internal variables following the incapsulation pattern. Also, we have the update_handling_time
function that allows us to update the agent’s performance.
We’ve created a class (an object that explains any kind of CS agent). Let’s make an instance of the object — the agent John Doe.
john_agent = CSAgent('John Doe', 12.3)
# Created CS agent with name John Doe
When we created an instance of the class, the function __init__
was executed. We can use __dict__
property to present class fields as a dictionary. It often can be handy, for example, if you want to convert a list of objects into a data frame.
print(john_agent.__dict__)
# {'name': 'John Doe', 'average_handling_time': 12.3, 'role': 'CS agent'}
We can try to execute a method and update the agent’s performance.
john_agent.update_handling_time(5.4)
# Updating time from 12.30 to 5.40print(john_agent.get_handling_time())
# 5.4
One of the fundamental concepts of OOP that we will use today is inheritance. Inheritance allows us to have a high-level ancestor class and use its features in the descendant classes. Imagine we want to have not only CS agents but also KYC agents. We can create a high-level Agent
class with common functionality and define it only once for both KYC and CS agents.
class Agent:
# initialising class
def __init__(self, name, average_handling_time, role):
# saving parameters mentioned during object creation
self.name = name
self.average_handling_time = average_handling_time
self.role = role
print('Created %s with name %s' % (self.role, self.name))def get_name(self):
return self.name
def get_handling_time(self):
return self.average_handling_time
def update_handling_time(self, average_handling_time):
print('Updating time from %.2f to %.2f' % (self.average_handling_time,
average_handling_time))
self.average_handling_time = average_handling_time
Now, we can create separate classes for these agent types and define slightly different __init__
and get_job_description
functions.
class KYCAgent(Agent):
def __init__(self, name, average_handling_time):
super().__init__(name, average_handling_time, 'KYC agent')def get_job_description(self):
return 'KYC (Know Your Customer) agents help to verify documents'
class CSAgent(Agent):
def __init__(self, name, average_handling_time):
super().__init__(name, average_handling_time, 'CS agent')
def get_job_description(self):
return 'CS (Customer Support) answer customer questions and help resolving their problems'
To specify inheritance, we mentioned the base class in brackets after the current class name. With super()
, we can call the base class methods, for example, __init__
to create an object with a custom role
value.
Let’s create objects and check whether they work as expected.
marie_agent = KYCAgent('Marie', 25)
max_agent = CSAgent('Max', 10)print(marie_agent.__dict__)
# {'name': 'Marie', 'average_handling_time': 25, 'role': 'KYC agent'}
print(max_agent.__dict__)
# {'name': 'Max', 'average_handling_time': 10, 'role': 'CS agent'}
Let’s update Marie’s handling time. Even though we haven’t implemented this function in the KYCAgent
class, it uses the implementation from the base class and works quite well.
marie_agent.update_handling_time(22.5)
# Updating time from 25.00 to 22.50
We can also call the methods we defined in the classes.
print(marie_agent.get_job_description())
# KYC (Know Your Customer) agents help to verify documentsprint(max_agent.get_job_description())
# CS (Customer Support) answer customer questions and help resolving their problems
So, we’ve covered the basics of the Objective-oriented paradigm and Python classes. I hope it was a helpful refresher.
Now, it’s time to return to our task and the model we need for our simulation.
Architecture: classes
If you haven’t used OOP a lot before, switching your mindset from procedures to objects might be challenging. It takes some time to make this mindset shift.
One of the life hacks is to use real-world analogies (i.e. it’s pretty clear that an agent is an object with some features and actions).
Also, don’t be afraid to make a mistake. There are better or worse program architectures: some will be easier to read and support over time. However, there are a lot of debates about the best practices, even among mature software engineers, so I wouldn’t bother trying to make it perfect too much for analytical ad-hoc research.
Let’s think about what objects we need in our simulation:
System
— the most high-level concept we have in our task. The system will represent the current state and execute the simulation.- As we discussed before, the system is a collection of entities. So, the next object we need is
Agent
. This class will describe agents working on tasks. - Each agent will have its schedule: hours when this agent is working, so I’ve isolated it into a separate class
Schedule
. - Our agents will be working on customer requests. So, it’s a no-brainer— we need to represent them in our system. Also, we will store a list of processed requests in the
System
object to get the final stats after the simulation. - If no free agent picks up a new customer request, it will be put into a queue. So, we will have a
RequestQueue
as an object to store all customer requests with the FIFO logic (First In, First Out). - The following important concept is
TimeLine
that represents the set of events we need to process ordered by time. TimeLine
will include events, so we will also create a classEvent
for them. Since we will have a bunch of different event types that we need to process differently, we can leverage the OOP inheritance. We will discuss event types in more detail in the next section.
That’s it. I’ve put all the classes and links between them into a diagram to clarify it. I use such charts to have a high-level view of the system before starting the implementation — it helps to think about the architecture early on.
As you might have noticed, the diagram is not super detailed. For example, it doesn’t include all field names and methods. It’s intentional. This schema will be used as a helicopter view to guide the development. So, I don’t want to spend too much time writing down all the field and method names because these details might change during the implementation.
Architecture: event types
We’ve covered the program architecture, and now it’s time to think about the main drivers of our simulation — events.
Let’s discuss what events we need to generate to keep our system working.
- The event I will start with is the “Agent Ready” event. It shows that an agent starts their work and is ready to pick up a task (if we have any waiting in the queue).
- We need to know when agents start working. These working hours can depend on an agent and the day of the week. Potentially, we might even want to change the schedules during the simulation. It’s pretty challenging to create all “Agent Ready” events when we initialise the system (especially since we don’t know how much time we need to finish the simulation). So, I propose a recurrent “Plan Agents Schedule” event to create ready-to-work events for the next day.
- The other essential event we need is a “New Customer Request” — an event that shows that we got a new CS contact, and we need to either start working on it or put it in a queue.
- The last event is “Agent Finished Task“, which shows that the agent finished the task he was working on and is potentially ready to pick up a new task.
That’s it. These four events are enough to run the whole simulation.
Similar to classes, there are no right or wrong answers for system modelling. You might use a slightly different set of events. For example, you can add a “Start Task” event to have it explicitly.
Implementation
You can find the full implementation on GitHub.
We’ve defined the high-level structure of our solution, so we are ready to start implementing it. Let’s start with the heart of our simulation — the system class.
Initialising the system
Let’s start with the __init__
method for the system class.
First, let’s think about the parameters we would like to specify for the simulation:
agents
— set of agents that will be working in the CS team,queue
— the current queue of customer requests (if we have any),initial_date
— since we agreed to use the actual timestamps instead of relative ones, I will specify the date when we start simulations,logging
— flag that defines whether we would like to print some info for debugging,customer_requests_df
— data frame with information about the set of customer requests we would like to process.
Besides input parameters, we will also create the following internal fields:
current_time
— the simulation clock that we will initialise as 00:00:00 of the initial date specified,timeline
object that we will use to define the order of events,processed_request
— an empty list where we will store the processed customer requests to get the data after simulation.
It’s time to take the necessary actions to initialise a system. There are only two steps left:
- Plan agents work for the first day. I’ll generate and process a corresponding event with an initial timestamp.
- Load customer requests by adding corresponding “New Customer Request” events to the timeline.
Here’s the code that does all these actions to initialise the system.
class System:
def __init__(self, agents, queue, initial_date,
customer_requests_df, logging = True):
initial_time = datetime.datetime(initial_date.year, initial_date.month,
initial_date.day, 0, 0, 0)
self.agents = agents
self.queue = RequestQueue(queue)
self.logging = logging
self.current_time = initial_timeself._timeline = TimeLine()
self.processed_requests = []
initial_event = PlanScheduleEvent('plan_agents_schedule', initial_time)
initial_event.process(self)
self.load_customer_request_events(customer_requests_df)
It’s not working yet since it has links to non-implemented classes and methods, but we will cover it all one by one.
Timeline
Let’s start with the classes we used in the system definition. The first one is TimeLine
. The only field it has is the list of events. Also, it implements a bunch of methods:
- adding events (and ensuring that they are ordered chronologically),
- returning the next event and deleting it from the list,
- telling how many events are left.
class TimeLine:
def __init__(self):
self.events = []def add_event(self, event:Event):
self.events.append(event)
self.events.sort(key = lambda x: x.time)
def get_next_item(self):
if len(self.events) == 0:
return None
return self.events.pop(0)
def get_remaining_events(self):
return len(self.events)
Customer requests queue
The other class we used in initialisation is RequestQueue
.
There are no surprises: the request queue consists of customer requests. Let’s start with this building block. We know each request’s creation time and how much time an agent will need to work on it.
class CustomerRequest:
def __init__(self, id, handling_time_secs, creation_time):
self.id = id
self.handling_time_secs = handling_time_secs
self.creation_time = creation_timedef __str__(self):
return f'Customer Request {self.id}: {self.creation_time.strftime("%Y-%m-%d %H:%M:%S")}'
It’s a simple data class that contains only parameters. The only new thing here is that I’ve overridden the __str__
method to change the output of a print function. It’s pretty handy for debugging. You can compare it yourself.
test_object = CustomerRequest(1, 600, datetime.datetime(2024, 5, 1, 9, 42, 1))
# without defining __str__
print(test_object)
# <__main__.CustomerRequest object at 0x280209130># with custom __str__
print(test_object)
# Customer Request 1: 2024-05-01 09:42:01
Now, we can move on to the requests queue. Similarly to the timeline, we’ve implemented methods to add new requests, calculate requests in the queue and get the subsequent request from the queue.
class RequestQueue:
def __init__(self, queue = None):
if queue is None:
self.requests = []
else:
self.requests = queuedef get_requests_in_queue(self):
return len(self.requests)
def add_request(self, request):
self.requests.append(request)
def get_next_item(self):
if len(self.requests) == 0:
return None
return self.requests.pop(0)
Agents
The other thing we need to initialise the system is agents. First, each agent has a schedule — a period when they are working depending on a weekday.
class Schedule:
def __init__(self, time_periods):
self.time_periods = time_periodsdef is_within_working_hours(self, dt):
weekday = dt.strftime('%A')
if weekday not in self.time_periods:
return False
hour = dt.hour
time_periods = self.time_periods[weekday]
for period in time_periods:
if (hour >= period[0]) and (hour < period[1]):
return True
return False
The only method we have for a schedule is whether at the specified moment the agent is working or not.
Let’s define the agent class. Each agent will have the following attributes:
id
andname
— primarily for logging and debugging purposes,schedule
— the agent’s schedule object we’ve just defined,request_in_work
— link to customer request object that shows whether an agent is occupied right now or not.effectiveness
— the coefficient that shows how efficient the agent is compared to the expected time to solve the particular task.
We have the following methods implemented for agents:
- understanding whether they can take on a new task (whether they are free and still working),
- start and finish processing the customer request.
class Agent:
def __init__(self, id, name, schedule, effectiveness = 1):
self.id = id
self.schedule = schedule
self.name = name
self.request_in_work = None
self.effectiveness = effectivenessdef is_ready_for_task(self, dt):
if (self.request_in_work is None) and (self.schedule.is_within_working_hours(dt)):
return True
return False
def start_task(self, customer_request):
self.request_in_work = customer_request
customer_request.handling_time_secs = int(round(self.effectiveness * customer_request.handling_time_secs))
def finish_task(self):
self.request_in_work = None
Loading initial customer requests to the timeline
The only thing we are missing from the system __init__
function (besides the events processing that we will discuss in detail a bit later) is load_customer_request_events
function implementation. It’s pretty straightforward. We just need to add it to our System
class.
class System:
def load_customer_request_events(self, df):
# filter requests before the start of simulation
filt_df = df[df.creation_time >= self.current_time]
if filt_df.shape[0] != df.shape[0]:
if self.logging:
print('Attention: %d requests have been filtered out since they are outdated' % (df.shape[0] - filt_df.shape[0]))# create new customer request events for each record
for rec in filt_df.sort_values('creation_time').to_dict('records'):
customer_request = CustomerRequest(rec['id'], rec['handling_time_secs'],
rec['creation_time'])
self.add_event(NewCustomerRequestEvent(
'new_customer_request', rec['creation_time'],
customer_request
))
Cool, we’ve figured out the primary classes. So, let’s move on to the implementation of the events.
Processing events
As discussed, I will use the inheritance approach and create an Event
class. For now, it implements only __init__
and __str__
functions, but potentially, it can help us provide additional functionality for all events.
class Event:
def __init__(self, event_type, time):
self.type = event_type
self.time = timedef __str__(self):
if self.type == 'agent_ready_for_task':
return '%s (%s) - %s' % (self.type, self.agent.name, self.time)
return '%s - %s' % (self.type, self.time)
Then, I implement a separate subclass for each event type that might have a bit different initialisation. For example, for the AgentReady
event, we also have an Agent
object. More than that, each Event class implements process
method that takes system
as an input.
class AgentReadyEvent(Event):
def __init__(self, event_type, time, agent):
super().__init__(event_type, time)
self.agent = agentdef process(self, system: System):
# get next request from the queue
next_customer_request = system.queue.get_next_item()
# start processing request if we had some
if next_customer_request is not None:
self.agent.start_task(next_customer_request)
next_customer_request.start_time = system.current_time
next_customer_request.agent_name = self.agent.name
next_customer_request.agent_id = self.agent.id
if system.logging:
print('<%s> Agent %s started to work on request %d' % (system.current_time,
self.agent.name, next_customer_request.id))
# schedule finish processing event
system.add_event(FinishCustomerRequestEvent('finish_handling_request',
system.current_time + datetime.timedelta(seconds = next_customer_request.handling_time_secs),
next_customer_request, self.agent))
class PlanScheduleEvent(Event):
def __init__(self, event_type, time):
super().__init__(event_type, time)
def process(self, system: System):
if system.logging:
print('<%s> Scheeduled agents for today' % (system.current_time))
current_weekday = system.current_time.strftime('%A')
# create agent ready events for all agents working on this weekday
for agent in system.agents:
if current_weekday not in agent.schedule.time_periods:
continue
for time_periods in agent.schedule.time_periods[current_weekday]:
system.add_event(AgentReadyEvent('agent_ready_for_task',
datetime.datetime(system.current_time.year, system.current_time.month,
system.current_time.day, time_periods[0], 0, 0),
agent))
# schedule next planning
system.add_event(PlanScheduleEvent('plan_agents_schedule', system.current_time + datetime.timedelta(days = 1)))
class FinishCustomerRequestEvent(Event):
def __init__(self, event_type, time, customer_request, agent):
super().__init__(event_type, time)
self.customer_request = customer_request
self.agent = agent
def process(self, system):
self.agent.finish_task()
# log finish time
self.customer_request.finish_time = system.current_time
# save processed request
system.processed_requests.append(self.customer_request)
if system.logging:
print('<%s> Agent %s finished request %d' % (system.current_time, self.agent.name, self.customer_request.id))
# pick up the next request if agent continue working and we have something in the queue
if self.agent.is_ready_for_task(system.current_time):
next_customer_request = system.queue.get_next_item()
if next_customer_request is not None:
self.agent.start_task(next_customer_request)
next_customer_request.start_time = system.current_time
next_customer_request.agent_name = self.agent.name
next_customer_request.agent_id = self.agent.id
if system.logging:
print('<%s> Agent %s started to work on request %d' % (system.current_time,
self.agent.name, next_customer_request.id))
system.add_event(FinishCustomerRequestEvent('finish_handling_request',
system.current_time + datetime.timedelta(seconds = next_customer_request.handling_time_secs),
next_customer_request, self.agent))
class NewCustomerRequestEvent(Event):
def __init__(self, event_type, time, customer_request):
super().__init__(event_type, time)
self.customer_request = customer_request
def process(self, system: System):
# check whether we have a free agent
assigned_agent = system.get_free_agent(self.customer_request)
# if not put request in a queue
if assigned_agent is None:
system.queue.add_request(self.customer_request)
if system.logging:
print('<%s> Request %d put in a queue' % (system.current_time, self.customer_request.id))
# if yes, start processing it
else:
assigned_agent.start_task(self.customer_request)
self.customer_request.start_time = system.current_time
self.customer_request.agent_name = assigned_agent.name
self.customer_request.agent_id = assigned_agent.id
if system.logging:
print('<%s> Agent %s started to work on request %d' % (system.current_time, assigned_agent.name, self.customer_request.id))
system.add_event(FinishCustomerRequestEvent('finish_handling_request',
system.current_time + datetime.timedelta(seconds = self.customer_request.handling_time_secs),
self.customer_request, assigned_agent))
That’s actually it with the events processing business logic. The only bit we need to finish is to put everything together to run our simulation.
Putting all together in the system class
As we discussed, the System
class will be in charge of running the simulations. So, we will put the remaining nuts and bolts there.
Here’s the remaining code. Let me briefly walk you through the main points:
is_simulation_finished
defines the stopping criteria for our simulation — no requests are in the queue, and no events are in the timeline.process_next_event
gets the next event from the timeline and executesprocess
for it. There’s a slight nuance here: we might end up in a situation where our simulation never ends because of recurring “Plan Agents Schedule” events. That’s why, in case of processing such an event type, I check whether there are any other events in the timeline and if not, I don’t process it since we don’t need to schedule agents anymore.run_simulation
is the function that rules our world, but since we have quite a decent architecture, it’s a couple of lines: we check whether we can finish the simulation, and if not, we process the next event.
class System:
# defines the stopping criteria
def is_simulation_finished(self):
if self.queue.get_requests_in_queue() > 0:
return False
if self._timeline.get_remaining_events() > 0:
return False
return True# wrappers for timeline methods to incapsulate this logic
def add_event(self, event):
self._timeline.add_event(event)
def get_next_event(self):
return self._timeline.get_next_item()
# returns free agent if we have one
def get_free_agent(self, customer_request):
for agent in self.agents:
if agent.is_ready_for_task(self.current_time):
return agent
# finds and processes the next event
def process_next_event(self):
event = self.get_next_event()
if self.logging:
print('# Processing event: ' + str(event))
if (event.type == 'plan_agents_schedule') and self.is_simulation_finished():
if self.logging:
print("FINISH")
else:
self.current_time = event.time
event.process(self)
# main function
def run_simulation(self):
while not self.is_simulation_finished():
self.process_next_event()
It was a long journey, but we’ve done it. Amazing job! Now, we have all the logic we need. Let’s move on to the funny part and use our model for analysis.
You can find the full implementation on GitHub.
Analysis
I will use a synthetic Customer Requests dataset to simulate different Ops setups.
First of all, let’s run our system and look at metrics. I will start with 15 agents who are working regular hours.
# initialising agents
regular_work_week = Schedule(
{
'Monday': [(9, 12), (13, 18)],
'Tuesday': [(9, 12), (13, 18)],
'Wednesday': [(9, 12), (13, 18)],
'Thursday': [(9, 12), (13, 18)],
'Friday': [(9, 12), (13, 18)]
}
)agents = []
for id in range(15):
agents.append(Agent(id + 1, 'Agent %s' % id, regular_work_week))
# inital date
system_initial_date = datetime.date(2024, 4, 8)
# initialising the system
system = System(agents, [], system_initial_date, backlog_df, logging = False)
# running the simulation
system.run_simulation()
As a result of the execution, we got all the stats in system.processed_requests
. Let’s put together a couple of helper functions to analyse results easier.
# convert results to data frame and calculate timings
def get_processed_results(system):
processed_requests_df = pd.DataFrame(list(map(lambda x: x.__dict__, system.processed_requests)))
processed_requests_df = processed_requests_df.sort_values('creation_time')
processed_requests_df['creation_time_hour'] = processed_requests_df.creation_time.map(
lambda x: x.strftime('%Y-%m-%d %H:00:00')
)processed_requests_df['resolution_time_secs'] = list(map(
lambda x, y: int(x.strftime('%s')) - int(y.strftime('%s')),
processed_requests_df.finish_time,
processed_requests_df.creation_time
))
processed_requests_df['waiting_time_secs'] = processed_requests_df.resolution_time_secs - processed_requests_df.handling_time_secs
processed_requests_df['waiting_time_mins'] = processed_requests_df['waiting_time_secs']/60
processed_requests_df['handling_time_mins'] = processed_requests_df.handling_time_secs/60
processed_requests_df['resolution_time_mins'] = processed_requests_df.resolution_time_secs/60
return processed_requests_df
# calculating queue size with 5 mins granularity
def get_queue_stats(processed_requests_df):
queue_stats = []
current_time = datetime.datetime(system_initial_date.year, system_initial_date.month, system_initial_date.day, 0, 0, 0)
while current_time <= processed_requests_df.creation_time.max() + datetime.timedelta(seconds = 300):
queue_size = processed_requests_df[(processed_requests_df.creation_time <= current_time) & (processed_requests_df.start_time > current_time)].shape[0]
queue_stats.append(
{
'time': current_time,
'queue_size': queue_size
}
)
current_time = current_time + datetime.timedelta(seconds = 300)
return pd.DataFrame(queue_stats)
Also, let’s make a couple of charts and calculate weekly metrics.
def analyse_results(system, show_charts = True):
processed_requests_df = get_processed_results(system)
queue_stats_df = get_queue_stats(processed_requests_df)stats_df = processed_requests_df.groupby('creation_time_hour').aggregate(
{'id': 'count', 'handling_time_mins': 'mean', 'resolution_time_mins': 'mean',
'waiting_time_mins': 'mean'}
)
if show_charts:
fig = px.line(stats_df[['id']],
labels = {'value': 'requests', 'creation_time_hour': 'request creation time'},
title = '<b>Number of requests created</b>')
fig.update_layout(showlegend = False)
fig.show()
fig = px.line(stats_df[['waiting_time_mins', 'handling_time_mins', 'resolution_time_mins']],
labels = {'value': 'time in mins', 'creation_time_hour': 'request creation time'},
title = '<b>Resolution time</b>')
fig.show()
fig = px.line(queue_stats_df.set_index('time'),
labels = {'value': 'number of requests in queue'},
title = '<b>Queue size</b>')
fig.update_layout(showlegend = False)
fig.show()
processed_requests_df['period'] = processed_requests_df.creation_time.map(
lambda x: (x - datetime.timedelta(x.weekday())).strftime('%Y-%m-%d')
)
queue_stats_df['period'] = queue_stats_df['time'].map(
lambda x: (x - datetime.timedelta(x.weekday())).strftime('%Y-%m-%d')
)
period_stats_df = processed_requests_df.groupby('period')
.aggregate({'id': 'count', 'handling_time_mins': 'mean',
'waiting_time_mins': 'mean',
'resolution_time_mins': 'mean'})
.join(queue_stats_df.groupby('period')[['queue_size']].mean())
return period_stats_df
# execution
analyse_results(system)
Now, we can use this function to analyse the simulation results. Apparently, 15 agents are not enough for our product since, after three weeks, we have 4K+ requests in a queue and an average resolution time of around ten days. Customers would be very unhappy with our service if we had just 15 agents.
Let’s find out how many agents we need to be able to cope with the demand. We can run a bunch of simulations with the different number of agents and compare results.
tmp_dfs = []for num_agents in tqdm.tqdm(range(15, 105, 5)):
agents = []
for id in range(num_agents):
agents.append(Agent(id + 1, 'Agent %s' % id, regular_work_week))
system = System(agents, [], system_initial_date, backlog_df, logging = False)
system.run_simulation()
tmp_df = analyse_results(system, show_charts = False)
tmp_df['num_agents'] = num_agents
tmp_dfs.append(tmp_df)
We can see that from ~25–30 agents, metrics for different weeks are roughly the same, so there’s enough capacity to handle incoming requests and queue is not growing week after week.
If we model the situation when we have 30 agents, we can see that the queue is empty from 13:50 till the end of the working day from Tuesday to Friday. Agents spend Monday processing the huge queue we are gathering during weekends.
With such a setup, the average resolution time is 500.67 minutes, and the average queue length is 259.39.
Let’s try to think about the possible improvements for our Operations team:
- we can hire another five agents,
- we can start leveraging LLMs and reduce handling time by 30%,
- we can shift agents’ schedules to provide coverage during weekends and late hours.
Since we now have a model, we can easily estimate all the opportunities and pick the most feasible one.
The first two approaches are straightforward. Let’s discuss how we can shift the agents’ schedules. All our agents are working from Monday to Friday from 9 to 18. Let’s try to make their coverage a little bit more equally distributed.
First, we can cover later and earlier hours, splitting agents into two groups. We will have agents working from 7 to 16 and from 11 to 20.
Second, we can split them across working days more evenly. I used quite a straightforward approach.
In reality, you can go even further and allocate fewer agents on weekends since we have way less demand. It can improve your metrics even further. However, the additional effect will be marginal.
If we run simulations for all these scenarios, surprisingly, we will see that KPIs will be way better if we just change agents’ schedules. If we hire five more people or improve agents’ performance by 30%, we won’t achieve such a significant improvement.
Let’s see how changes in agents’ schedules affect our KPIs. Resolution time grows only for cases outside working hours (from 20 to 7), and queue size never reaches 200 cases.
That’s an excellent result. Our simulation model has helped us prioritise operational changes instead of hiring more people or investing in LLM tool development.
We’ve discussed the basics of this approach in this article. If you want to dig deeper and use it in practice, here are a couple more suggestions that might be useful:
- Before starting to use such models in production, it’s worth testing them. The most straightforward way is to model your current situation and compare the main KPIs. If they differ a lot, then your system doesn’t represent the real world well enough, and you need to make it more accurate before using it for decision-making.
- The current metrics are customer-focused. I’ve used average resolution time as the primary KPI to make decisions. In business, we also care about costs. So, it’s worth looking at this task from an operational perspective as well, i.e. measure the percentage of time when agents don’t have tasks to work on (which means we are paying them for nothing).
- In real life, there might be spikes (i.e. the number of customer requests has doubled because of a bug in your product), so I recommend you use such models to ensure that your CS team can handle such situations.
- Last but not least, the model I’ve used was entirely deterministic (it returns the same result on every run), because handling time was defined for each customer request. To better understand metrics variability, you can specify the distribution of handling times (depending on the task type, day of the week, etc.) for each agent and get handling time from this distribution at each iteration. Then, you can run the simulation multiple times and calculate the confidence intervals of your metrics.
Summary
So, let’s briefly summarise the main points we’ve discussed today:
- We’ve learned the basics of the discrete-event simulation approach that helps to model discrete systems with a countable number of events.
- We’ve revised the object-oriented programming and classes in Python since this paradigm is more suitable for this task than the common procedural code data analysts usually use.
- We’ve built the model of the CS team and were able to estimate the impact of different potential improvements on our KPIs (resolution time and queue size).
Thank you a lot for reading this article. If you have any follow-up questions or comments, please leave them in the comments section.
Reference
All the images are produced by the author unless otherwise stated.