How Zerodha reduced their backend processing time from hours to minutes using AWS Batch

Hemant Jain
6 min readSep 27, 2022

Zerodha, India’s largest retail stockbroker by active client base, contributes upwards of 18% of daily retail volumes across Indian stock exchanges. One of Zerodha’s key business objectives in 2021 was to provide accurate profit and loss (PnL) calculations to its customers on a daily basis. However, this involved overcoming several challenges around data processing. This guest post covers how Zerodha scaled their backend processing using AWS Batch.

Backend Processing

Backend processes for stockbrokers include cash settlement, stock settlement, PnL calculation, buy average calculation, and generating and sending invoices. Stockbrokers get data from exchanges and this information comprises transactions and metadata (tax calculation, margins, instrument data, etc.). These processes are CPU intensive because in certain cases, like PnL calculations, stockbrokers have to process not only current transactions, but also historical transactions that may go back several years. These processes have to be completed before the next trading session commences. For Zerodha, the challenge became multi-fold with the growth it was seeing in its customer base and data volume.

Zerodha was implementing most backend processes using commercial third-party software. This meant there was very little room for optimization. The following figure shows a high-level view of the bigger pipeline which comprises the PnL step.

Figure 1. Workflow with PnL step

Profit and Loss (PnL) calculation

A PnL calculation requires data from the last trading session as well as older historical data. With more than 100 million transactional records, for an average of 3 million customers daily, the data from these transactions typically grows four times its original (input) size once the PnL processing is complete.

Zerodha’s Original PnL Process

The PnL calculation is part of a larger workflow that involves multiple steps that are beyond the scope of this blog post, as shown in Figure 1. These steps were executed before and after the PnL step and all of the steps in this workflow had to be completed before the next trading session. The following diagram shows Zerodha’s previous PnL calculation process.

Figure 2. PnL Calculation workflow

The original process was executed as follows.

  1. Data from the exchanges was pulled using Secure File Transfer Protocol (SFTP) to local disks on on-prem servers.
  2. A vendor system performed processing before the data was persisted. This processing comprised charge calculations, generation of contract notes, etc.
  3. The data from step 2 was then persisted in Zerodha’s databases. The PnL calculation was performed by software Zerodha developed in-house, and data was persisted in database clusters.

Zerodha had to complete all the necessary backend processing before the commencement of the next trading session. The workflow for the PnL calculation, described in the previous paragraph, was time consuming, and risked breaching Zerodha’s internal deadlines. The workflow typically took nearly seven hours with almost a tenth of the number of records that Zerodha processes today. With data from exchanges typically unavailable until 12 AM and with the ever-increasing number of customers, Zerodha realized that this workflow had to be optimized to meet future demand.

Zerodha’s New PnL Process

The software Zerodha used to perform the PnL calculation was not designed to scale. They decided to optimize this calculation step, and the solution they needed to execute this step had specific requirements. They defined the core requirements of the PnL calculation to meet the ever-growing data requirements, and tight internal schedules:

  1. The processing logic should support parallelism to ensure the processing is scalable with the data and customer growth.
  2. The processing logic should have observability of the resource usage of their processes, specifically, CPU. The PnL calculation is extremely compute intensive, and observability into the resource utilization would ensure that the process could be tuned later through code. Also, a configuration that would enable Zerodha to scale their cluster out or in, as required, could help them optimize costs on the processing cluster.
  3. The I/O during the PnL calculation process has to be extremely fast to meet the data persistence requirements.
  4. Orchestration of the jobs that comprise the PnL calculation should be managed by the solution vendor as this would save on development and operational efforts.

Zerodha chose AWS Batch because of its flexibility to leverage multiple job queues, mapped to multiple compute environments, to parallelize processing of input records. Because Batch uses Amazon Elastic Container Service (Amazon ECS) clusters under the hood, observing resource utilization was easy. Finally, AWS Batch, being a fully managed service, does not require the installation or management of batch computing software. Zerodha had to only containerize the business logic and configure the right set of network rules to get it working. A Redis cluster was also added for fast I/O, as shown in Figure 3, for the new process.

Figure 3. PnL calculation with AWS Batch

Following are details of how the PnL step is executed with AWS Batch.

  1. Data from the exchanges is pulled using SFTP to an Amazon Simple Storage Service (Amazon S3) bucket.
  2. An application running on an Amazon Elastic Compute Cloud (Amazon EC2) instances perform processing before the data is persisted. This processing comprises charge calculations, generation of contract notes, etc.
  3. The processed data is persisted in PostgreSQL and MySQL databases running on Amazon EC2 instances.
  4. An AWS Batch job is then triggered to extract the preprocessed data from the databases and populate the data into a Redis cluster to ensure faster I/O during processing.
  5. The AWS Batch job then starts processing the records and persists the processed results back to the Redis cluster (shown as green arrows in Figure 3).
  6. An application running on an Amazon EC2 cluster then extracts the data from the Redis cluster, and performs post-processing such as removing unwanted data and adding customer related data before persisting the data back into Zerodha’s databases.

AWS Batch setup details

By conducting several experiments, Zerodha defined the size of the tasks that should be submitted for a specific number of vCPUs and memory. AWS Batch made every step after this easy, as all they had to do was submit tasks, and then the “Best Fit” strategy autoscaled the number of Amazon EC2 instances spawned from the pool of instance types that was defined in the compute environment configuration.

One of the very first challenges that Zerodha faced with Batch was that it took a long time (around five minutes) to start spawning Amazon EC2 instances before the tasks would get invoked. They found a workaround for this by “pre-warming” the compute environment.

The compute environment could be configured with a task’s Docker image and the job queue could be tweaked on-demand. They set the “Desired vCPUs” to a definite number (for example, 100) that matched the number of CPUs of one of the instance types defined in the configuration. This resulted in the submitted tasks spawning Amazon EC2 instances almost instantly.

Hence “pre-warming” their compute environment helped them save another few minutes that would otherwise have added to the overall time required to finish all the calculations. Using AWS Batch they were able to take advantage of a large compute environment on-demand, and prioritize their business-critical objective — which is, efficiently show accurate PnL of their positions to their customers.

They were able to take their original seven-hour process and finish it in 20–30 minutes once they moved to Batch.

Conclusion

Zerodha was able to achieve one of its key business objectives of providing accurate PnL calculations to its customers on a daily basis. AWS Batch was a critical service in helping reduce the time it took to process PnL from seven hours to around 20–30 minutes.

AWS Batch enabled Zerodha to focus on optimizing their core business logic without having to worry about managing batch jobs, to scale with growing data processing requirements, and to observe resource utilization in their compute environments to optimize costs. Zerodha continues to wow its ever-increasing customer base and AWS is happy to be part of their journey.

To learn more about AWS Batch please visit its service page on the AWS website, and to learn more about Zerodha visit their website.

--

--

Hemant Jain

Sr. SRE at Oracle, Ex-PayPal, Ex-RedHat. Professional Graduate Student interested in Cloud Computing and Advanced Big Data Processing and Optimization.