Application Performance Analysis as a dedicated area within IT organization.
For
more than five years I’ve been working for a big transnational company as a
global IT infrastructure support specialist. An idea to dedicate Application Performance
Analysis (APA) into separate internal service appeared about two years ago and
now it is time to summarize first results of my work. The main goal of this article
is to receive constructive feedbacks from respectable visitors of this resource
and use it for further development of the service.
Application Performance Management
Nowadays,
a topic of Application Performance Management [1] (APM), more general in regards to
APA, becomes quite popular. Many large companies recognize or close to
recognition of importance of this area, but its practical
realization and applied efforts evaluation is not that simple. Even a
formulation, what the “optimal application performance” is represents a complicated task with ambiguous conditions and derivation. Moreover,
nowadays there is no reliable way to estimate financial profit of investments
in Application Performance Management [2] and we can only use our common sense when estimating an APM value in every particular case.
Let’s
take a typical example. Imagine we have some important application. Our business
applies severe requirements at its availability, because every hour of outage
causes noticeable financial impact. The business and the IT department sing-off a
respective SLA where all requirements are reflected. In order to support this
SLA the IT department configures a specific monitoring tool which regularly polls
an application front-end interface with a simple request and expects a typical
response. If the application stops responding test requests the monitoring system
generates an automatic alert to respective support teams. Now, can we say that
the IT department fairly does its job and completely supports business interests?
Not sure.
One
day, application users discover that their application can be launched
successfully and quite fast, but all the key operations are running unacceptably slow. The automatic
monitoring keeps silence and we are going to be very lucky if there would be at
least one user who would call to technical support before the problem causes
severe business impact. My practice shows that users rarely escalate such kind
of issues on time. There are plenty of reasons explaining such end-users
behavior, e.g. trivial unwillingness to call somewhere and explain ambiguous
and complex symptoms of the issue. The application is still working somehow and they can continue doing their job, but slower than usual. It may occur that such a problem would last for
years causing hidden impact to the business, annoying people and significantly
undermining reputation of the IT department. Sometimes the problem comes to light
when the business impact from poor application performance becomes too obvious and
hits a senior management attention focus.
Then,
the quite interesting story begins. The IT senior management demands an immediate problem
resolution. At a round-table discussion every team commits to validate its
responsibility area. When they return with verification results we come to a fascinating
conclusion – “Everything is all right!” Servers are fine – nether CPU nor
memory is overloaded, system logs are clear from errors. The network is good –
there is no packet loss or links overload. The client part should be Ok as well
because other applications work fast at the same workstations. The problem is that when we assemble all
these parts into one system it works terrible, but where to search for a root
cause? The application vendor also doesn’t help referring to other clients
who don’t complain and pointing to “some problems with infrastructure”. Then,
we usually have one of three continuations (sometimes in a combination):
- After long and unsuccessful study the problem can be forgotten (the business reconciles with slow performance of its application), but during the process of remediation we have spent a lot of efforts and money on chaotic upgrade of various system elements. That didn’t help, but we have tried!
- The business cannot accept poor application performance and, looking at unsuccessful attempts of the IT department to fix the issue, takes a third party consultant who finds a root cause somewhere at the different technologies crossing point. However, this costs a lot of money and seriously undermines reputation of the IT department.
- Someone from the IT department identifies a root cause but works far beyond its responsibility area.
The IT
department can save its face only in the last case scenario, but there is no
guarantee that it happens. Moreover, it is absolutely not clear how to motivate
people to work beyond their responsibility areas and how to keep transparency
and control in this case.
Main functions of Application Performance Analysis
The IT department management of my
company has supported my idea to dedicate Application Performance Analysis into
a separate area. For a company this gave an ability to address the most ambiguous
issues related to applications performance in a standard and transparent way.
For me this was a good opportunity to officially grow the area and do what I
can do the best – perform cross-technological analysis of various IT systems. Now,
let’s figure out what exactly APA does in our understanding.
At APA we have assigned a following set
of function:
- APA helps development teams with an initial benchmarking of new applications
- APA provides infrastructure support and development teams with cross-technological analysis of various performance/functional issues
- APA provides infrastructure and application support teams with continuous transactions performance monitoring and alerting for Global Data Center based applications
Picture
1
One of the key instruments of APA
is its Application Architecture Knowledge Base. In this base we reflect
a functional meaning and technical architecture of every application which ever
hit the focus of APA. Also it stores all related cases investigated by APA.
This tool helps us not to do the same job twice and makes APA
related to a more general area - Enterprise Architecture [3].
In order to avoid excessive expectations
let’s define a clear boundary of an APA responsibility area:
- APA helps development teams with an initial performance benchmarking but it is not responsible for a comprehensive assessment of new applications. Questions related to functionality, backup, redundancy, etc. are out of APA research focus.
- In complex cases APA helps support and development teams with the root cause identification, but it doesn’t manage an investigation process and not responsible for successful issue remediation.
- APA provides a solution for continuous application performance monitoring and alerting and does customized configuration for specific customer needs, but it is not responsible for successful fix of issues identified with this solution.
A nature of APA is reflected in its
logotype (see picture 2). The lighthouse just helps to identify a proper course, but
an actual route selection is always on a captain decision.
Picture
2
The APA
niche among other support services can be defined as the following:
- Deadlock cases investigation
- Cross-technological expertise
- Prevention of performance incidents
In ITIL Performance Management is a part of Capacity Management [4]. So, it may be said that APA sets Performance Management beyond traditional
understanding of ITIL.
APA plays an important part within an Application Performance Management process but it doesn’t substitute it. This is just centralization and unification of the most complicated parts of it – monitoring and analysis, while functions of actual performance management are left with respective support teams – where they are most efficient. They are easier to establish a personal contact with vendor of their product, they more interested in a success of their services. APA is just an instrument, addressing the most complicated part of APM, everything else can be resolved with a standard management procedures.
APA plays an important part within an Application Performance Management process but it doesn’t substitute it. This is just centralization and unification of the most complicated parts of it – monitoring and analysis, while functions of actual performance management are left with respective support teams – where they are most efficient. They are easier to establish a personal contact with vendor of their product, they more interested in a success of their services. APA is just an instrument, addressing the most complicated part of APM, everything else can be resolved with a standard management procedures.
APA methods
APA
specialist analyses network traffic to identify causes of performance issues.
Network represents a universal media which unites various application
components and in total most of cases it can provide enough information for
faulty element identification. However, this method has its own limitations.
For instance, we can identify a particular server with introduces the most
delay in end-user operation processing. We can even identify the most delayed
system call (e.g. GET or POST HTML request and related longest SQL call to DB).
But we cannot say what is actually going on within server during execution of
faulty operation. Moreover, if for some reasons we use the same hardware server
for hosting an application and its DB environment – there is no way to separate
their impact on total operation processing time. Fortunately, in the total most
of cases APA approach gives good results. From my practice I can say that about
99% of APA escalations were resolved successfully, while the information gained
with APA methods played a key role in resolution process.
At
the market there are plenty of software and hardware solutions supporting the
Application Performance Management area. We use two main instruments: one for off-line detailed study of captured transactions and another one for agentless
application-aware network performance monitoring with intelligent decoding of
various high level protocols.
Now
let’s consider methods which we use in every particular APA direction.
Operational support methods
In operational support clear
formulation of each incident is a key success factor. End-user can only
subjectively describe a performance issue (often, it does that quite
emotionally). At the first stage it is very important to separate an emotional
component (too strong coffee, bad mood or dislike for a particular application)
from actual problem description. For this purpose, I have created a special
list of questions to be answered during incident creation. It is important to
highlight, that APA never interacts with end-user. All necessary information
should be provided by universal specialists of Tier 1 and 2. Sometimes,
important information can be collected with continuous performance monitoring
tools.
Also, APA needs good description of
impacted application architecture which should be provided by a respective
application support team. If, for some reasons, the application support team
cannot provide this information, APA uses its tools to identify application
architecture itself. Usually, I demand to install special software agents on
client and server parts of the application which I use to synchronously capture
the network traffic during impacted operation. After we negotiate problem
testing procedure I either reproduce the issue myself (as per instruction) or
ask someone to reproduce is at my command. Simultaneously I activate capturing
on my agents. Resulting material is a set of trace files which were
simultaneously recorded on various application components during problematic
transaction execution. They contain full history of network operations at the
selected hosts. Key advantage of this approach is that the tool we use allows merging
these files into one trace aligned in time (i.e. 0ms of the trace captured from
client is the same 0ms of trace files captured from servers). This allows
tracking a delay on each stage of operation – what time it took to deliver a
client’s request to the application server, how long the application server was
processing this request before sending necessary SQL request to DB, how long DB
was processing this request, and so on.
Let’s consider a particular example.
Here is a
fragment of architecture of some application (see pic.3)
Picture
3
After
collecting and merging traces or problematic operation looks as shown at
picture 4:
Picture
4
Then we can automatically
determine an influence of each application component at total transaction time
(see picture 5).
Picture
5
As we can
see, our problem is related to application server and CUBE server, but we can
figure out a bit deeper. For instance we can identify the most delayed calls (see
pic.6).
Picture
6
Now we have a good materials
for escalation to vendor of this software.
My practice shows
that without such detailed investigation most of software vendors cannot fix performance
problems with their applications. It is also important to say that this approach can help with complicated functional issues investigation (not only
related to performance), where support teams has reached a deadlock in their
investigation.
Initial application performance benchmarking
Technically
this service is completely based on the method of operational support. On top of
that for centralized applications we can mathematically forecast front-end
interface performance on various locations.
Application performance monitoring
As it was said earlier, during performance incident investigation it
is quite uneasy task to obtain well defined information from end-user. Just due
to subjective nature of the question. On the other hand, technical support
cannot operate with subjective metrics. This conflict can make us to focus at
nonexistent or immaterial problems, while really important issues are passing
over. Main outcome here is that for effective application performance
management we need an independent automatic monitoring system. Completely
relying on end users input in this question is senseless.
To
resolve this problem we use agentless application-aware network performance
monitoring tool. It uses SPAN ports (or mirroring ports) as a source of
information. This means that we don’t create any influence at the systems of
our interest.
This
solution not only helps with operative problem identification, but also gives
us with an educated hint where to search for a root cause. Moreover, with this
tool we can analyze a long term trends of applications performance and validate
performance impact from changes within our infrastructure.
Let’s
consider a couple of examples of our system usage.
At the picture 7 you can see general performance and load assessment
of the front-end interface of one of our application servers.
Picture 7
Here: Operations (left scale)
– a total number of HTTP operations for a given time period, it divides on slow operations and fast operations. Operation is assumed
to be slow if executed slower than its predefined threshold. Average operation time (right scale) – is
an averaged time of operation execution within given time resolution.
Performance and load of respective DB server for the same time
period is shown at the picture 8.
Picture
8
We can also understand which particular SQL calls have caused a
spike of “Average operation time” between 8 30 and 9 00, but there is no strong
need for it as we can see that it didn’t impact front-end interface performance.
In-house vs. outsource
At
the market you can find a lot of companies which provide application
performance monitoring and analysis as a service. As I could understand they do
this job with more or less the same tools that I do, but there can be add-ons
and variations. If you are thinking of this service roll-out at your particular
business, you definitely meet the question: whether to develop the area
in-house or use help of outsourcing companies. When I was going through this
stage I have distinguished several important points which helped us to take a
final decision. Hopefully they will be interesting and useful for you as well.
One
for the first factors to consider is a price. In case of outsourcing
installation cost can be noticeably lower than in-house service would
demand. Moreover, you can avoid any
increase of your fixed assets (as we just buy a service). For in-house roll-out
we should be ready for expenses – servers, licenses, personal with efforts on
its hiring and education, annual maintenance and resources for initial
installation and configuration. Also a time frame necessary for all our efforts
and spending would start bringing profit is going to be much larger if we
decide to do the entire job ourselves. All these arguments may be quite
valuable in a short term perspective, but with the lapse of time the balance
can change. First of all, outsource service vendor is a commercial
organization. It has co cover its expenses and do that with revenue. It is quite possible that in a long term perspective with
more or less high load this service would demand much more expenses if it goes
outsourced rather in-house. I have spent a lot of time trying to compare all
possible options. I had to make a number of assumptions on the way, relying at
existent experience of my company: how many investigations we need to perform
during a year, what time does it usually takes, and what it is going to cost at
the end.
Not
less important question is what we actually have in the result of our investments:
quality, volume and promptness of critical incidents investigations.
Third
party vendor is always working within signed contract. Unlike internal
associates it is absolutely not interested in a success of our business. This
leads to inflexibility in a boundary cases. As a rule, everything what is not
mentioned in a contract is either not possible or pricey. For instance, we may
have highly critical incident not related to performance, but we are sure that
it can be effectively investigated with the same tools and approaches we use during
performance cases investigation. Internal APA team will definitely help, but
what would an external vendor say? Formally, this case is far beyond the
contract and we cannot be sure that we can get round their manager to help us,
simply because it can be not technical enough to understand all our arguments.
One
of possible result of APA investigation can be recommendation to significantly
change existing infrastructure. This can happen when we want not only to
resolve an existing incident, but prevent the problem occurrence in future.
Third party vendor is not usually interested in proactive problems resolution.
This is their income. Therewith, such recommendations assume high level of
responsibility, because they may impact a general strategy of infrastructure
development. Sometimes, to live such questions at the decision of external
vendor is not what we actually want to do.
Not
less complicated question is rapidity in critical incidents investigation. Performance
investigation – is a complicated area. We never can predict what time each case
can take, there are too many unknown factors.
In any way this is not something we can write down in a contract. But
what to do when we need especial efforts on at the root cause analysis of the
particular issue? Internal team is interested in our business success, while an
external vendor is not.
Effective
collaboration with external vendor assumes a kind of formalization of requests
and results. Now the question is: are every internal subdivisions of IT ready
to clearly formulate their demands and can understand what exactly vendor means
in its replies? Quite often I receive questions, but the people are asking not
what they actually need. It appears, that even with outsourcing we need someone
within our company who would be working as a translator. He should know a
proper people within the company and should be ready to talk to vendor with
their languish. This role assumes good
technical expertise, the same which is required to perform the entire job in-house.
Consideration
of all the points listed above lead us to decision to live APA in-house.
However, for every particular business case the proper answer can be different.
Any possible argument in this area would be questionable, just because of the
nature of the question. I’m just sharing our experience on it.
Conclusion
Everything
written above is a version of practical realization of APA. In our case it
gives a good result and our management appreciates our work. For the last year
APA was involved in many critical issues resolution and in majority of cases it
played a key role within investigation process. Moreover, in many cases I can
hardly imagine how achieved findings could be gained without APA methods. I
hope that you find this material to be useful. I will appreciate your
constructive critics, ideas and interesting questions.
Truly
yours, Daniil Kochetov (daniil.kochetov@effem.com)
References
[2] “Retail
IT Service Operation: Calculating the Impacts of Poor Application Performance
Across a Business Ecosystem”. Enterprise Management Associates. 2011