Hyperic Enterprise Monitoring 0
My real job is as a Systems Specialist running one of the world’s largest retail websites. It’s a huge operation which requires a knowledge of the whole scope of today’s typical enterprise technologies. The size of the environments and breadth of scope is what keeps me interested and enthusiastic as I typically get to play with anything and everything. However, from an operations perspective, one big problem is performance monitoring and getting the metrics you need, when you need them. In the past we had a huge bunch of hacks and scripts stacked up like a house of cards, just to get a few measly bits of performance data. I was yearning for a single, supported solution that could give us wide-ranging and ‘live’ performance data on any component of our environments, whilst not creating any kind of performance overhead.
To give you an idea of what we needed to be supported by this solution, here is a very brief example of just the essentials of the technologies we need to monitor for our website:
- AIX and Linux machines, everything from CPU usage information on each of 10 processors in each machine, to network interface throughput and filesystems.
- IBM HTTP Server & Apache stats such as busy workers, number of 404’s, and response times.
- IBM WebSphere Application Server threadpools, jdbc pools and JVM free heap sizes, per application, per server
- Oracle 10g tablespaces, open cursors, table scans etc
The problem typically is that whilst you can find one solution for monitoring one of these things, you then need another solution for another. It was time to find something professional and comprehensive that could monitor it all in one place.
The solution I happened upon was Hyperic HQ. There are two versions of the server, Open Source (yay!) and Enterprise. It’s enormous list of supported technologies was impressive, but most important for us was the (rare!) inbuilt support for WebSphere AppServer. As we are really at the stage of assessing how capable the software is we opted for Open Source to get something up and working as quickly as possible.
The solution works using a simple Server/Agent architecture. The server comes with a self-contained postgres database (supporting up to 25 platforms) but can be configured to use almost any external database you can think of such as mySQL or Oracle. The server collates the data sent by the agents and makes it all accessible through it’s web interface (dashboard) meaning you can create graphs and view metrics on any monitored resource in one place.
The agents have some magic that (once installed) auto-discover all the servers and processes on the machine it is installed on, and inventorises lots of nice information such as MAC addresses and primary/secondary DNS servers etc.
Once the server is installed and you have agents configured, most things are auto-discovered and key metrics start to be collected. However it is worth spending some time configuring what should be collected, and how often, so that you minimise the overhead on the machines themselves, and don’t have lots of useless information slowing down the server/db.
So how well does it work? Well the first couple of days you will find yourself in a daze of nerd-glee whilst you wonder at how much you can collect and how easy it is to produce charts of anything you want over any period of time (if you are talking about a single metric or even a group of identical metrics from different machines). The autodiscovery is excellent and requires very little intervention to start collecting what you need. But then a few days later you will also find yourself increasingly frustrated at how many limitations (artificial in many cases) are enforced that make getting the data out in the way you want it to be presented either difficult and clumbsy, or just plain impossible.
Sure I can drill down in fantastic detail on any single metric and get very specific information on (for example) the Free Heap Size of one particular application server’s JVM. I can even group a bunch of Appservers and see all the Free Heaps plotted on the same graph. But surprisingly what HQ fails to recognise in the way it presents the data is that often different metrics are related to each other and it makes life very difficult to get real ’cause and effect’ information on a single page. For example if we see the threadpool maxed out, I need several clicks and an ingrained knowledge required to find myself on the JDBC usage page to see that the database is obviously not responding to in a timely fashion caused the pileup in the threadpool.
My biggest gripe (and slightly related to the above point) from an operations perspective is the lack of a real overview mode (there is a community plugin called wallmount, but it is super-simplistic and very buggy showing nothing more than buttons coloured for availability). What operations centre doesn’t have the giant monitor on the wall showing key performance metrics from all the production machines? There is no such mode in Hyperic, and I have spent many hours already trying to build something that would do such a thing. In our example we need to see, on one screen, live graphs for JVM Usages for all appservers, busy workers for all web servers, threadpool usage for all appservers, JDBC pools to the databases and probably some simpler indicators for things such as network status. Each of these things I can see in fantastic detail individually.
The most annoying part is that this information is all there and collected, there is just no way to present all (or even two of) these things on one, autorefreshing screen. For it’s intended market this is a glaring omission which really reduces the usefulness of all that beautiful metric data. I am trying to make a butt-ugly 8 frame page with a graph in each frame to get us what we want, but the API doesn’t make it clear how I should get a time value of ‘now’ in the url’s nor how to set the scale to 4 hours for example (not to mention I would have to log an admin user into each frame every time the machine was restarted).
Hyperic may argue that each user can be give an account, log in and manually monitor anything they want. You can add favourite charts and resources to your dashboard for quick access but that brings me onto the next point, the limitations of the Open Source version that are artificially enforced to try and get you to upgrade to the paid Enterprise Version. There are no user-roles in the open source version meaning everyone is an administrator. I don’t want to give administrator privileges to the guys in the control room so they are able to delete servers, remove users, create their own hodgepodge of resource groups etc. Incidentally there is no LDAP authentication meaning our security model is broken by Hyperic HQ. Amongst the other limitations are no reporting, the inability to set baselines for the charts, when adding a favourite graph to the dashboard only a link is shown instead of a live graph as in the enterprise version, etc etc.
This is in my opinion the totally wrong way to go about the business model. If you are going to have an open source business model it should be along the lines of a fully functional open source version of the software making money from paid support services like Red Hat et al. As a key piece of software in a production environment we would always pay for a support contract quite happily. I hope this is a model they move to in the near future. I would consider using the trial version of the enterprise software to assess whether the benefits are worth the money, but there is no way to downgrade to Open Source version and vice versa meaning we lose all collected data in the process. There is a lot of talk about open source, community development etc on the Hyperic webpage, but it seems like quite a stale and quiet community with very little in the way of community development, no doubt discouraged by the business model and lack of Hyperic presence on the forums.
So Overall what are my thoughts? On the data collection, inventorisation, and auto-discovery side of things – excellent, really truly superb. There are some little things I would like to see added such as web stats and the fact that appserver monitoring uses the default threadpool (where as we use the webcontainer threadpool for instance) which is a niggle I have to work around by creating a custom service. But they are small details in the grand scheme of things which I can customise when I get the time.
On the presentation of data, Hyperic HQ excels in providing data on single metrics, but is appalling in its configurability to get varied metrics displayed in a convenient and customisable way. The lack of the Wallmount/Overview/Cockpit/whatever you want to call it mode is really inexcusable as I have already mentioned (several times!). The last point is that the Open Source version is limited and encourages nothing in the way of community development when so many key features are artificially excluded.
I realise my tone is rather harsh, but in my opinion a product is either open source or not. You cannot claim to be open source and try and get community development enthused with a model such as Hyperic’s. Would I recommend Hyperic HQ? Yes absolutely if you have an environment without clusters, interdependencies and instead simple single server applications with a limited monitoring audience. As it stands, my search is continuing for a solution that (open source or not) can provide a better picture of a large and high volume operations environment.
No related posts.
subscribe to comments RSS
There are no comments for this post