Database Pool Size and Runtime Instances - 23.1

Spectrum Dataflow Designer Guide

Version
23.1
Language
English
Product name
Spectrum Technology Platform
Title
Spectrum Dataflow Designer Guide
First publish date
2007
Last updated
2024-05-09
Published on
2024-05-09T23:01:03.226155

In most Spectrum Technology Platform environments there are multiple flows running at the same time, whether they are batch jobs or services responding to web service or API requests. To optimize concurrent processing, you can use the database pool size setting, which limits the number of concurrent requests a Spectrum database handles, and runtime instances, which controls the number of instances of a flow stage that run concurrently. These two settings should be tuned together to achieve optimal performance.

Database Pool Size

Spectrum databases contain reference data used by certain stages, such as postal data used to validate addresses, or geocoding data used to geocode addresses. These databases can be configured to accept multiple concurrent requests from the stages or services that use them, thereby improving the performance of the flows or service requests. The database pool size sets the maximum number of concurrent requests that a Spectrum database will process. By default, Spectrum databases have a pool size of 4, meaning the database can process four requests simultaneously.

The optimal pool size varies by module. You will generally see the best results by setting the pool size between one-half to twice the number of CPUs on the server, with the optimal pool size for most modules being the same as the number of CPUs. For example, if your server has four CPUs you may want to experiment with a pool size between 2 (one-half the number of CPUs) and 8 (twice the number of CPUs) with the optimal size possibly being 4 (the number of CPUs).

When modifying the pool size you must also consider the number of runtime instances specified in the flow for the stages accessing the database. Consider for example a flow that has a Global Geocode stage that is configured to use one runtime instance. If you set the pool size for the Geocoding database to four, you will not see a performance improvement because there would be only one runtime instance and therefore there would only be one request at a time to the database. However, if you were to increase the number of runtime instances of Global Geocode to four, you might then see an improvement in performance since there would be four instances accessing the database simultaneously, therefore using the full pool.

Runtime Instances

Each stage in a flow operates asynchronously in its own thread and is independent of any other stage. This provides for parallel processing of stages in a flow, allowing you to utilize more than one runtime instance for a stage. This is useful in flows where some stages process data faster than others. This can lead to an unbalanced distribution of work among the threads. For example, consider a flow consisting of these stages:

Depending on the configuration of the stages, it may be that the Validate Address stage processes records faster than the Global Geocode stage. If this is the case, at some point during the execution of the flow all the records will have been processed by Validate Address, but Global Geocode will still have records to process. In order to improve performance of this flow, it is necessary to improve the performance of the slowest stage - in this case Global Geocode. One way to do that is to specify multiple runtime instances of the stage. Setting the number of runtime instances to two, for example, means that there will be two instances of that stage, each running in its own thread, available to process records. Keep in mind that while specifying multiple runtime instances can help improve performance, setting this value too high can strain your system resources, resulting in decreased performance.

Tuning Procedure

Finding the right settings for database pool size and runtime instances is a matter of experimenting with different settings to find the ones maximize available server resources without overloading resources and causing reduced performance.

Note: You should optimize the flow pool size before tuning the database pool size. For information about optimizing the flow pool size, see Dataflow Pool Size.
  1. Begin by finding sample data to use as you test different settings. The sample dataset should be large enough that execution time is measurable and can be validated for consistency. The sample data should also be representative of the actual data you want to process. For example, if you are doing performance testing for geocoding, be sure that your test data has an equal number of records for all the countries you intend to geocode.
  2. If you are testing a service or flow that requires the use of a database resource, such as postal databases or geocoding databases, make sure that you have the latest version of the database installed.
  3. With sample data ready and the latest database resources installed, create a simple flow that reads data from a file, processes it with the stage you want to optimize, and writes to a file. For example, if you want to test performance settings for Validate Address, create a flow consisting of Read from File, Validate Address, and Write to File.
  4. Set the database resource pool size to 1:
    1. Open Spectrum Management Console.
    2. Go to Resources > Spectrum Databases.
    3. Select the database resource you want to optimize and click the Modify button .
    4. In the Pool size field, specify 1.
    5. Click OK.
  5. Set the stage's runtime instances to 1:
    1. Open the flow in Spectrum Enterprise Designer.
    2. Double-click the stage that you want to set to use multiple runtime instances.
    3. Click Runtime.
      Note: Not all stages are capable of using multiple runtime instances. If there is no Runtime button at the bottom of the stage's window, the stage is not capable of using multiple runtime instances.
    4. Select Local and specify 1.
    5. Click OK to close the Runtime Performance window, then click OK to close the stage.
  6. Calculate baseline performance by running the flow several times and recording the average values for:
    • Elapsed time
    • CPU utilization
    • Memory utilization
    Tip: You can use the Spectrum JMX console to monitor performance. For more information, see Monitoring Performance with the Spectrum JMX Console.
  7. Run multiple instances of the job concurrently, if this is a use case that must be supported. Record elapsed time, CPU utilization, and memory utilization for each scenario.
    Tip: You can use a file monitor to run multiple instances of a job at once. For more information, see Triggering a Flow with a Control File.
  8. Increment the database resource pool size and the stage runtime instances setting.
  9. Restart the server.
  10. Run the flow again, recording the elapsed time, CPU utilization, and memory utilization.
  11. Continue to increment the database resource pool size and the stage runtime instances until you begin to see diminishing performance.
  12. If you are testing geocoding performance, repeat this procedure using single country and multicountry input.