Monday, December 30, 2013

Application Performance Analysis as a dedicated area within IT organization



Application Performance Analysis as a dedicated area within IT organization.

For more than five years I’ve been working for a big transnational company as a global IT infrastructure support specialist. An idea to dedicate Application Performance Analysis (APA) into separate internal service appeared about two years ago and now it is time to summarize first results of my work. The main goal of this article is to receive constructive feedbacks from respectable visitors of this resource and use it for further development of the service.

Application Performance Management

Nowadays, a topic of Application Performance Management [1] (APM), more general in regards to APA, becomes quite popular. Many large companies recognize or close to recognition of importance of this area, but its practical realization and applied efforts evaluation is not that simple. Even a formulation, what the “optimal application performance” is represents a complicated task with ambiguous conditions and derivation. Moreover, nowadays there is no reliable way to estimate financial profit of investments in Application Performance Management [2] and we can only use our common sense when estimating an APM value in every particular case.
Let’s take a typical example. Imagine we have some important application. Our business applies severe requirements at its availability, because every hour of outage causes noticeable financial impact. The business and the IT department sing-off a respective SLA where all requirements are reflected. In order to support this SLA the IT department configures a specific monitoring tool which regularly polls an application front-end interface with a simple request and expects a typical response. If the application stops responding test requests the monitoring system generates an automatic alert to respective support teams. Now, can we say that the IT department fairly does its job and completely supports business interests? Not sure.
One day, application users discover that their application can be launched successfully and quite fast, but all the key operations are running unacceptably slow. The automatic monitoring keeps silence and we are going to be very lucky if there would be at least one user who would call to technical support before the problem causes severe business impact. My practice shows that users rarely escalate such kind of issues on time. There are plenty of reasons explaining such end-users behavior, e.g. trivial unwillingness to call somewhere and explain ambiguous and complex symptoms of the issue. The application is still working somehow and they can continue doing their job, but slower than usual. It may occur that such a problem would last for years causing hidden impact to the business, annoying people and significantly undermining reputation of the IT department. Sometimes the problem comes to light when the business impact from poor application performance becomes too obvious and hits a senior management attention focus.
Then, the quite interesting story begins. The IT senior management demands an immediate problem resolution. At a round-table discussion every team commits to validate its responsibility area. When they return with verification results we come to a fascinating conclusion – “Everything is all right!” Servers are fine – nether CPU nor memory is overloaded, system logs are clear from errors. The network is good – there is no packet loss or links overload. The client part should be Ok as well because other applications work fast at the same workstations. The problem is that when we assemble all these parts into one system it works terrible, but where to search for a root cause? The application vendor also doesn’t help referring to other clients who don’t complain and pointing to “some problems with infrastructure”. Then, we usually have one of three continuations (sometimes in a combination):

  • After long and unsuccessful study the problem can be forgotten (the business reconciles with slow performance of its application), but during the process of remediation we have spent a lot of efforts and money on chaotic upgrade of various system elements. That didn’t help, but we have tried!
  • The business cannot accept poor application performance and, looking at unsuccessful attempts of the IT department to fix the issue, takes a third party consultant who finds a root cause somewhere at the different technologies crossing point. However, this costs a lot of money and seriously undermines reputation of the IT department.
  • Someone from the IT department identifies a root cause but works far beyond its responsibility area.


          The IT department can save its face only in the last case scenario, but there is no guarantee that it happens. Moreover, it is absolutely not clear how to motivate people to work beyond their responsibility areas and how to keep transparency and control in this case.

Main functions of Application Performance Analysis

      The IT department management of my company has supported my idea to dedicate Application Performance Analysis into a separate area. For a company this gave an ability to address the most ambiguous issues related to applications performance in a standard and transparent way. For me this was a good opportunity to officially grow the area and do what I can do the best – perform cross-technological analysis of various IT systems. Now, let’s figure out what exactly APA does in our understanding.
            At APA we have assigned a following set of function:
  • APA helps development teams with an initial benchmarking of new applications 
  • APA provides infrastructure support and development teams with cross-technological analysis of various performance/functional issues
  • APA provides infrastructure and application support teams with continuous transactions performance monitoring and alerting for Global Data Center based applications


Picture 1

One of the key instruments of APA is its Application Architecture Knowledge Base. In this base we reflect a functional meaning and technical architecture of every application which ever hit the focus of APA. Also it stores all related cases investigated by APA. This tool helps us not to do the same job twice and makes APA related to a more general area - Enterprise Architecture [3].
In order to avoid excessive expectations let’s define a clear boundary of an APA responsibility area:
  • APA helps development teams with an initial performance benchmarking but it is not responsible for a comprehensive assessment of new applications. Questions related to functionality, backup, redundancy, etc. are out of APA research focus.
  • In complex cases APA helps support and development teams with the root cause identification, but it doesn’t manage an investigation process and not responsible for successful issue remediation.
  • APA provides a solution for continuous application performance monitoring and alerting and does customized configuration for specific customer needs, but it is not responsible for successful fix of issues identified with this solution.
A nature of APA is reflected in its logotype (see picture 2). The lighthouse just helps to identify a proper course, but an actual route selection is always on a captain decision.

Picture 2
The APA niche among other support services can be defined as the following:
  •  Deadlock cases investigation
  • Cross-technological expertise
  • Prevention of performance incidents
          In ITIL Performance Management is a part of Capacity Management [4]. So, it may be said that APA sets Performance Management beyond traditional understanding of ITIL. 
         APA plays an important part within an Application Performance Management process but it doesn’t substitute it. This is just centralization and unification of the most complicated parts of it – monitoring and analysis, while functions of actual performance management are left with respective support teams – where they are most efficient. They are easier to establish a personal contact with vendor of their product, they more interested in a success of their services. APA is just an instrument, addressing the most complicated part of APM, everything else can be resolved with a standard management procedures.

APA methods

APA specialist analyses network traffic to identify causes of performance issues. Network represents a universal media which unites various application components and in total most of cases it can provide enough information for faulty element identification. However, this method has its own limitations. For instance, we can identify a particular server with introduces the most delay in end-user operation processing. We can even identify the most delayed system call (e.g. GET or POST HTML request and related longest SQL call to DB). But we cannot say what is actually going on within server during execution of faulty operation. Moreover, if for some reasons we use the same hardware server for hosting an application and its DB environment – there is no way to separate their impact on total operation processing time. Fortunately, in the total most of cases APA approach gives good results. From my practice I can say that about 99% of APA escalations were resolved successfully, while the information gained with APA methods played a key role in resolution process.
At the market there are plenty of software and hardware solutions supporting the Application Performance Management area. We use two main instruments: one for off-line detailed study of captured transactions and another one for agentless application-aware network performance monitoring with intelligent decoding of various high level protocols.
Now let’s consider methods which we use in every particular APA direction.

Operational support methods

            In operational support clear formulation of each incident is a key success factor. End-user can only subjectively describe a performance issue (often, it does that quite emotionally). At the first stage it is very important to separate an emotional component (too strong coffee, bad mood or dislike for a particular application) from actual problem description. For this purpose, I have created a special list of questions to be answered during incident creation. It is important to highlight, that APA never interacts with end-user. All necessary information should be provided by universal specialists of Tier 1 and 2. Sometimes, important information can be collected with continuous performance monitoring tools.
            Also, APA needs good description of impacted application architecture which should be provided by a respective application support team. If, for some reasons, the application support team cannot provide this information, APA uses its tools to identify application architecture itself. Usually, I demand to install special software agents on client and server parts of the application which I use to synchronously capture the network traffic during impacted operation. After we negotiate problem testing procedure I either reproduce the issue myself (as per instruction) or ask someone to reproduce is at my command. Simultaneously I activate capturing on my agents. Resulting material is a set of trace files which were simultaneously recorded on various application components during problematic transaction execution. They contain full history of network operations at the selected hosts. Key advantage of this approach is that the tool we use allows merging these files into one trace aligned in time (i.e. 0ms of the trace captured from client is the same 0ms of trace files captured from servers). This allows tracking a delay on each stage of operation – what time it took to deliver a client’s request to the application server, how long the application server was processing this request before sending necessary SQL request to DB, how long DB was processing this request, and so on.
            Let’s consider a particular example.
Here is a fragment of architecture of some application (see pic.3)

Picture 3
After collecting and merging traces or problematic operation looks as shown at picture 4:

Picture 4
Then we can automatically determine an influence of each application component at total transaction time (see picture 5).

 Picture 5
As we can see, our problem is related to application server and CUBE server, but we can figure out a bit deeper. For instance we can identify the most delayed calls (see pic.6).

Picture 6
Now we have a good materials for escalation to vendor of this software.
My practice shows that without such detailed investigation most of software vendors cannot fix performance problems with their applications. It is also important to say that this approach can help with complicated functional issues investigation (not only related to performance), where support teams has reached a deadlock in their investigation.

Initial application performance benchmarking

Technically this service is completely based on the method of operational support. On top of that for centralized applications we can mathematically forecast front-end interface performance on various locations.

Application performance monitoring

As it was said earlier, during performance incident investigation it is quite uneasy task to obtain well defined information from end-user. Just due to subjective nature of the question. On the other hand, technical support cannot operate with subjective metrics. This conflict can make us to focus at nonexistent or immaterial problems, while really important issues are passing over. Main outcome here is that for effective application performance management we need an independent automatic monitoring system. Completely relying on end users input in this question is senseless.
To resolve this problem we use agentless application-aware network performance monitoring tool. It uses SPAN ports (or mirroring ports) as a source of information. This means that we don’t create any influence at the systems of our interest.
This solution not only helps with operative problem identification, but also gives us with an educated hint where to search for a root cause. Moreover, with this tool we can analyze a long term trends of applications performance and validate performance impact from changes within our infrastructure.  
Let’s consider a couple of examples of our system usage.
At the picture 7 you can see general performance and load assessment of the front-end interface of one of our application servers.

Picture 7
Here: Operations (left scale) – a total number of HTTP operations for a given time period, it divides on slow operations and fast operations. Operation is assumed to be slow if executed slower than its predefined threshold. Average operation time (right scale) – is an averaged time of operation execution within given time resolution.
Performance and load of respective DB server for the same time period is shown at the picture 8.

Picture 8
We can also understand which particular SQL calls have caused a spike of “Average operation time” between 8 30 and 9 00, but there is no strong need for it as we can see that it didn’t impact front-end interface performance.

In-house vs. outsource

At the market you can find a lot of companies which provide application performance monitoring and analysis as a service. As I could understand they do this job with more or less the same tools that I do, but there can be add-ons and variations. If you are thinking of this service roll-out at your particular business, you definitely meet the question: whether to develop the area in-house or use help of outsourcing companies. When I was going through this stage I have distinguished several important points which helped us to take a final decision. Hopefully they will be interesting and useful for you as well.
One for the first factors to consider is a price. In case of outsourcing installation cost can be noticeably lower than in-house service would demand.  Moreover, you can avoid any increase of your fixed assets (as we just buy a service). For in-house roll-out we should be ready for expenses – servers, licenses, personal with efforts on its hiring and education, annual maintenance and resources for initial installation and configuration. Also a time frame necessary for all our efforts and spending would start bringing profit is going to be much larger if we decide to do the entire job ourselves. All these arguments may be quite valuable in a short term perspective, but with the lapse of time the balance can change. First of all, outsource service vendor is a commercial organization. It has co cover its expenses and do that with revenue. It is quite possible that in a long term perspective with more or less high load this service would demand much more expenses if it goes outsourced rather in-house. I have spent a lot of time trying to compare all possible options. I had to make a number of assumptions on the way, relying at existent experience of my company: how many investigations we need to perform during a year, what time does it usually takes, and what it is going to cost at the end.
Not less important question is what we actually have in the result of our investments: quality, volume and promptness of critical incidents investigations.
Third party vendor is always working within signed contract. Unlike internal associates it is absolutely not interested in a success of our business. This leads to inflexibility in a boundary cases. As a rule, everything what is not mentioned in a contract is either not possible or pricey. For instance, we may have highly critical incident not related to performance, but we are sure that it can be effectively investigated with the same tools and approaches we use during performance cases investigation. Internal APA team will definitely help, but what would an external vendor say? Formally, this case is far beyond the contract and we cannot be sure that we can get round their manager to help us, simply because it can be not technical enough to understand all our arguments.
One of possible result of APA investigation can be recommendation to significantly change existing infrastructure. This can happen when we want not only to resolve an existing incident, but prevent the problem occurrence in future. Third party vendor is not usually interested in proactive problems resolution. This is their income. Therewith, such recommendations assume high level of responsibility, because they may impact a general strategy of infrastructure development. Sometimes, to live such questions at the decision of external vendor is not what we actually want to do.
Not less complicated question is rapidity in critical incidents investigation. Performance investigation – is a complicated area. We never can predict what time each case can take, there are too many unknown factors.  In any way this is not something we can write down in a contract. But what to do when we need especial efforts on at the root cause analysis of the particular issue? Internal team is interested in our business success, while an external vendor is not.
Effective collaboration with external vendor assumes a kind of formalization of requests and results. Now the question is: are every internal subdivisions of IT ready to clearly formulate their demands and can understand what exactly vendor means in its replies? Quite often I receive questions, but the people are asking not what they actually need. It appears, that even with outsourcing we need someone within our company who would be working as a translator. He should know a proper people within the company and should be ready to talk to vendor with their languish.  This role assumes good technical expertise, the same which is required to perform the entire job in-house.
Consideration of all the points listed above lead us to decision to live APA in-house. However, for every particular business case the proper answer can be different. Any possible argument in this area would be questionable, just because of the nature of the question. I’m just sharing our experience on it.

Conclusion

Everything written above is a version of practical realization of APA. In our case it gives a good result and our management appreciates our work. For the last year APA was involved in many critical issues resolution and in majority of cases it played a key role within investigation process. Moreover, in many cases I can hardly imagine how achieved findings could be gained without APA methods. I hope that you find this material to be useful. I will appreciate your constructive critics, ideas and interesting questions.
Truly yours, Daniil Kochetov (daniil.kochetov@effem.com)

References

[2] “Retail IT Service Operation: Calculating the Impacts of Poor Application Performance Across a Business Ecosystem”. Enterprise Management Associates. 2011