IBM DataPower service latency - a fresh look (for the API generation)
IBM DataPower has been a core component for many organisations acting as a trusted, secure gateway and integration hub. A significant amount of financial data traffic is handled by these devices. DataPower is now also a core product used as part of IBM API Connect where its role as a gateway is further harnessed.
So we set out to provide a monitoring solution for DataPower to add to our capabilities for other products within the IBM Integration Suite (WMB/IIB/ACE & MQ) in Square Bubble (https://squarebubble.io). One of the areas we spent some time focussing on was the service latency data that DataPower provides. This is a treasure trove of data that can help to determine performance of individual services using real, live data. The message, which includes the latency data, is available over syslog but it is rather difficult to interpret. Here is an example:
There is a technote which defines the fields as follows:
- Request header read
- Request header sent
- Front side transform begun
- Front side transform complete
- Entire request transmitted
- Front side style-sheet ready
- Front side parsing complete
- Response header received
- Response headers sent
- Back side transform begun
- Back side transform complete
- Response transmitted
- Back side style-sheet read
- Back side parsing complete
- Back side connection attempted
- Back side connection completed
This is not a particularly easy list to make use of and there are further complications as the numbers are not in order and represent a running total. You therefore need to be aware of the true order and the previous value to determine the true time taken of the various stages. This is is not an easy mathematical function.
With help from our colleague, and DataPower expert, Kim Seeley at Syntegrity Solutions, we mapped out a more user friendly set of metrics. The following sequence diagram represents the stages we believe are of most interest:
Let's take a closer look at each stage:
- Client Request is the time (in ms) taken for the request to be transmitted between the client and the Gateway
- Front Side Processing includes the time taken to perform stylesheet lookups and gateway scripts applicable to the Client request
- Back End Connect is the time taken to connect to the back end. Typically the gateway will reuse a connection where possible
- Back End Request is the time taken to transmit the request between Gateway and Back End
- Back End Processing is the time taken by the Back End to process the request
- Back End Response is the time taken to transmit the response between Back End and Gateway
- Back Side Processing is the time taken to perform stylesheet lookups and gateway scripts applicable to the Back End response, and finally
- Client Response is the time taken to transmit the response from Gateway to Client
In addition to these stages we provide the overall time taken (in ms). OK so where is the value in this breakdown? The overall figure is undoubtedly sufficient for high level SLA/OLA type monitoring. If the overall time is less than what is expected or what is required, all is good, right? That is fine but what if the time taken is longer than expected or required, what if the challenge is to streamline the process to take the minimum amount of time possible? In these cases we need to look under the covers, a bit, to see where the time is being consumed.
Some of these stages should be negligible. In a modern data centre with 1GB or 10GB network speeds (or faster), transmission of data, particularly with small payloads, should be almost instantaneous. Larger payloads (in the MegaBytes) however will take a number of milliseconds to deliver but this should be consistent over time, with some variation due to network latency. Front Side and Back Side Processing will also be mostly negligible (based on our experience), however, there is potential for significant time to be spent here should complex gateway scripts be deployed, but this is rare. Tracking these values for the same service over time should provide a good baseline of what is normal.
The greater sources of interest however are:
- Back End Processing. DataPower provides a true value for the time taken by the Back End. This could be compared with what the Back End is reporting, which we have found to be an interesting experience. It shows the true value as detected by the Gateway. This is particularly valuable when applied against a 3rd party service where SLAs are in place.
- Back End Connect. We should expect spikes in connection times at the start of a running deployment, however, connections should be reused by the Gateway for future requests. DataPower is particularly slick in this regard. This is, however, an area we have found some interesting results where some services needed reconnecting due to network timeouts. Connection processes may not represent significant time but they do represent work that could be avoided. They also may represent significant time, particularly if complex authentication and authorisation is required.
- Back End Request and Response could also be useful to monitor the network performance to the Back End. As stated previously if the Back End is deployed in the same data centre, the timings should be negligible (taking into consideration payload size). However, if they are accessed over a WAN or over the internet, this data provides a valuable source of actual network performance as seen by the Gateway.
When developing a solution, we believe it is imperative to have a good understanding of performance before it is deployed in a production environment. Assuming one has access to a representative non-production environment, this breakdown should provide ample data to: 1. Identify areas that could be optimised, which can be addressed incrementally, and 2. Provide baseline data of the performance expected in production.
Having this data available in production is of great value. It ensures your critical business services are operating both within acceptable limits and continue to deliver the benefits to your customers, which is what they should be all about, right?
IBM API Connect has an analysis module that provides service latency stats, however, it doesn’t drill down to the level we have described. That’s not the end of the road however, as we can provide this breakdown for you.