As an example, we developed
Logsense to troubleshoot
non-trivial multi-server issues at the Fortune-1 company. While Log analytics is not new, dealing with
a specific type of hierarchical logs that enabled tracing a request through multiple internal servers, and to make
sense of this in a high traffic environment was not a solved problem at the time. LogSense was extensively used
at the Fortune-1 company to troubleshoot issues, including bugs, performance bottlenecks, and security events.
Things that used to take hours, days, and weeks to figure out took ~ half an hour to resolve.
You can use it for scenarious such as the following:
- when you release new code;
- when new capacity is added;
- when seriously high traffic is expected;
- when a serious problem is happening on your website;
- when a suspected fraud has been perpetrated;
- during performance testing -- you will identify the root cause for issues quickly;
- during QA testing -- not only will your QA team quickly identify the issues, it will help you communicate the issues more efficiently to your development team;
- during development -- it will assist software engineers debug their systems, including debugging complex interactions with 3rd party systems.
Example things you can do with LogSense
- Thread related calls.
Inspect the log for a given user session across multiple requests, including what happened at various backend servers.
Inspect the details of a specific user request, including database calls, search calls, and other backend calls, ordered in the correct logical sequence regardless of how it appeared in the logs. Connect various logs together from different backend and frontend servers using associative mapping.
Essentially, handle hierarchies in your logs.
- Exceptions
Find all exceptions in your live server logs in the past 10 minutes.
Constrain them by log source, data center, specific set of machines, etc.
Skip those that you already know about, or if they are not of current interest. Save these as your preference for future search.
If a NullPointerException (or some other exception) is caused by different things, find all unique issues with one click.
- High Benches
Inspect the histogram if time taken by various end-to-end calls or subsystems within -- e.g., in 100msec buckets.
Drill down this histogram for higher benches and inspect the corresponding logs.
- Inspect specific subsystems
Find all calls to a caching layer or a data layer. Constrain it by end-to-end calls that take longer. Drill down for details.
- Inspect potential fraud
If a customer is logged on over concurrent user sessions from different browser types, are they trying to game your system? Have they found a vulnerability in your caching code, perhaps? You can zoom into these sessions, with data from your various backend servers, with just a few clicks.
- Compare performance over time as your code evolves
Generate statistical data on the number of sql calls for each url. Compare these over time as your code evolves so you know when and if there are additional sql calls.
Compare different server instances. Apply machine learning to automatically identify anomalies, and inspect them further with search. For example, a patch may have been applied to a subset of servers, or a license may have expired.