Welcome to Part 2 of the Expediting Bulk Data Processing Blog series. Here we would be looking at certain smart ways of improving performance of Batch Job in updating/processing large volume of data.
As a Recap, Part 1 of this series looked at impact of Data Skew – Ownership on holding the performance of Bulk data processing and how we could mitigate it accordingly (I would encourage you to visit Part1 of this series if you haven’t already)
In this article, we will be looking at some techniques on improving the performance of existing Batch Job.
I am sure some of our Salesforce Admins/Developers would have observed Batch Job slowness while processing large volume of data. Triggers OR backend customizations could be a factor to slowness BUT what if we need those customizations to execute as part of data updates and still expect Batch Job to perform faster?
This is exactly the situation with Mike, who is a Salesforce Developer with System Administrator Profile in Universal Container. Lets look at his usecase in detail.
Universal Container has Account object which stores Account’s country. There are 3M Accounts in the system and the associated Contacts are almost 8M.
Mike has developed a Batch Job that will populate Region__c (A Custom Field) in Contact record based on the associated Account’s Country. The Country to Region mapping is defined in a custom Table for the Code to refer. Batch Job got deployed on a launch weekend and the Business expectation is to have the Region value populated for 8M Contacts within that weekend.
Prior to Launch weekend, Mike had tested the batch Job performance in a sandbox. The Batch job took some time to query data but eventually processed 100k in 2 hrs. That’s a decent enough performance, BUT if we factor in 8M… by math the overall process will then consume 6 plus days. For sure, Mike will not be able to meet business expectations and his situation is very bad now.
What approach will Mike take to tackle this situation and what action he took eventually?
Aren’t you curious? Lets cut to the chase.
Mike had a 3 phased approach to this situation -> Query Analysis || Data Analysis || Admin settings. Lets dive onto them.
- Query Analysis: Mike looked at the Query of batch job and spotted couple of things:
Field indexing: Though the Query has proper WHERE clause and date range defined, the custom fields in the WHERE clause were not indexed. Custom Indexing plays a major factor in improving the SOQL query performance and one needs to raise a support case with Salesforce to enable it. Please refer to Salesforce Help Article for more details ( https://help.salesforce.com/articleView?id=000325247&type=1&mode=1) . Another interesting article on indexing is below
Need of Skinny Table: Mike observed that the Batch job query had both standard fields and Custom fields present. Apart from Indexing, Skinny Table also plays a helpful hand in improving the Query performance. Skinny table is very useful when you have large volume of data to process and the query refers both custom fields and standard fields. A normal query that refers both type of custom and standard fields will need a JOIN at the backend database table. BUT the same query against a Skinny table will avoid that Join, thus improving query performance. This is because Skinny table would host all these fields in a single table. Please refer to Salesforce Knowledge Article for more details ( https://developer.salesforce.com/docs/atlas.en-us.salesforce_large_data_volumes_bp.meta/salesforce_large_data_volumes_bp/ldv_deployments_infrastructure_skinny_tables.htm)
A posit observation from Mike was that, he had the overall query stored in a custom label and had the label referred in the Batch Job. This is one of the best and recommended practices in Salesforce development. Mike then proceed to the 2 nd phase of the approach – Data Analysis.
2. Data Analysis: Mike checked on Organization wide default setting of the Contact Object and it was set as “Controlled by Parent” (Implies Contacts OWD is controlled by associated Account). Mike then proceeded to understand the distribution of 8M contacts and found below. This helped him to plan the execution of Jobs.
Admin Settings: Mike learnt about two capabilities that ended up helping his situation.
Defer Sharing Calculation: Defer Sharing feature of Salesforce helps to suspend custom sharing rule calculations initiated by Salesforce as part of record update. Considering the Org having huge volume of data, such a capability will help improve performance. Please help refer Salesforce help article for more details ( https://help.salesforce.com/articleView?id=security_sharing_rule_recalculation.htm&type=5)
User Pin : This capability is Salesforce Product/Support team controlled and can only be enabled by them for requests pertaining to Large Data Volumes. Salesforce does not entertain such requests frequently unless there is a business Urgency. Salesforce shall only pin minimum number users based on the duration of request and one has to contact Salesforce support for the same. To my understanding, when a user gets pinned by Salesforce.. all the operations handled by the user are considered as priority and assigned additional resources by Salesforce at the backend. As a result, performance of the Job run by a pinned user has a better performance.
With the above, lets look at the actions Mike took during Launch weekend
1. Mike raised SFDC Support case for below purposes and Salesforce did help considering the case severity.
i. Enabling Skinny Table for improving Query Performance.
ii. Indexing the fields that were present in WHERE clause of the query.
iii. Pinning the User that will be used for running the batch job.
2. As Mike had the Batch Job query in custom label, he adjusted the query to be region specific and maintained 4 queries accordingly.
3. Mike Enabled Defer Sharing after deployment was over.
4. Then ran the Batch Job that had the query referring APJ region, using the pinned user. Mike followed it up by executing the Batch Job again after 10 minutes interval from the first job. This time the job’s query was referring EMEA.
5. Mike repeated above step for subsequent Regions. In other words, Mike invoked the power of Parallelism in expediting Large Data Volume data processing. Parallelism is one of the recommended practices of Salesforce and can be used in situations where BULK API doesn’t fit. And guess what?
With the above actions, Mike was able to process 8M of contact within a day and successfully met business expectations. The overall performance improved 8X and Mike was elated 😊
With appropriate analysis and decision making, Mike had a great accomplishment. Let’s give a big cheer to Mike.
And so, we conclude Part 2 of this Blog series. I hope this article helps you to resolve similar issues at your end. Am eager to hear your feedback and inputs.
Our Next article will focus on improving performance while update ownership of Accounts.
- Salesforce Field indexing (https://help.salesforce.com/articleView?id=000325247&type=1&mode=1)
- Salesforce Indexing performance Gotchas (https://developer.salesforce.com/blogs/engineering/2013/03/force-com-formula-fields-indexes-and-performance-gotchas.html )
- Defer Sharing calculations(https://help.salesforce.com/articleView?id=security_sharing_rule_recalculation.htm&type=5)