Smart Data Visualization: Helping Decision Makers Get the Picture

Smart data visualization is proving to be an essential tool in maintaining increasingly complex Big Data systems in the cloud.

The adoption of Big Data tools and technology heavily relies on distributed scaled out computing. One of the main differences in this setting is that it includes systems that operate as a whole on top of several independent hosts. These hosts coordinate their actions with limited information and as a result maintenance complexity significantly increases. One way to overcome this challenge is smart data visualization, which helps the IT experts and management pinpoint the source of problems quickly.

The need for smart visualization is not unique to this problem. Representing complex data as a concise picture which tells decision-makers a story is a key part of any data analytics or data science project. Valuable results of a rigorous analysis may remain undiscovered due to a lack of a visualization clearly communicating the underlying information to the reader. The importance of data visualization is not a novelty. A number of visualization tools, as well as a general interest in data visualization topics, have exploded in popularity in recent years, as evidenced by the proliferation of literature available about infographics and visualization arcanum in both print and online media.

Executive customers of the Data Science-as-a-Service (DSaaS) team can’t review every detail in the data they use. In order to make data-driven decisions and draw conclusions, what they need is a distilled version of the data. This is where smart visualization can be of the highest importance. It can allow readers understand “what is going on” in the data in just a few moments instead of having to undertake an annoying, time consuming analysis.

Nowadays, advanced data visualizations go beyond graphs and charts to help in the process of making crucial business decisions. Several visualization formats are available: static, zooming, clickable, animated, video, or interactive. The choice between these depends on the overall objectives of the visualization. While static is the simplest and the most common form of visualization, the interactive options are becoming more popular because they give users some control over the displayed information.

Many interesting and good visualizations can be found on the Visualizing.org portal () which can provide you with ideas for new visualization forms. One such example is an interactive dashboard providing global and local trends in human migration during the last 50 years:

DataSci_Image_1

In just a few seconds this dashboard reveals some interesting patterns, such as the migration “boom” of the 1980s, the consequent growth of migration worldwide and how armed conflicts produce big human movements in specific periods.

Another interesting visualization is Facebook’s friendship locality map where the lines represent real human relationships:

Data_Sci_2

Besides the fact that continents and certain international borders are visible on this map, a reader can observe the level of Facebook popularity and connectivity between people in different areas of the word. Also, the areas of low or lack of activity become immediately apparent and can be considered for more powerful marketing efforts. This is a nice example of a visualization highlighting specific relations within the data. While in the case of Facebook these are relations between social network users, in other cases it can be relations between devices, various types of events or services.

The above visualizations are examples of an efficient way to communicate data. Successful data visualizations are very space efficient and display all the data within a single field of view. This allows a reader to see the entire picture with minimal eye movement and without scrolling or flipping between the pages. In this post we will show how a very simple data analysis along with a creative visualization can help to asses a server’s status in a few seconds and save several days of work.

As a customer-facing organization, an important task for field support specialist at EMC is Root Cause Analysis (RCA) of a problem in a customer’s installation. RCA usually requires looking at records and manually correlating events from multiple log files. Although log files are commonly used for various support purposes, they are often very large and take a textual form that is difficult to follow. For example, below is a screen capture of part of a log file:

Data_Sci_3

There are tens of thousands of entries in a log file and hundreds of such files aggregated for each distinct machine or system. In scaled out systems, where a single device is comprised of several execution nodes, an additional layer of complexity is added. In this setting, manually digging through the heaps of logs becomes a difficult and time-consuming task that may require several days to complete. Although a few tools for analyzing log files exist, none are suitable for the specific RCA purpose which each of the field support specialists at EMC faces on a daily basis.

Below we show how a quick 30-minute python script can help reduce the time and effort invested by a specialist in analyzing multiple log files from several days to a few minutes or seconds. As an example, let us consider the analysis of log files from one of EMC’s systems. Recently, in one of our engagements, we experimented with a set of log files originating from a storage system with multiple nodes.

The purpose was to provide an easy-to-use support system for the correlation of multiple events over time so that a field support engineer can quickly identify events of interests that may have triggered a given problem. Note that it is different from the server’s health status monitoring where continuous and detailed analysis of numerous parameters is needed. Oppositely, a summary view over a long time period and relations between multiple events are of the main interest here.

After some data cleaning and processing, we can derive a list of the events occurring in the system across time (see the detailed explanation of the preprocessing at the bottom of this post) and use this data to power an explorative visualization of events. The events timeline was provided as an HTML file with several interactive capabilities (e.g. zoom, resize, hover, etc.) as shown in the screenshot below.

Data_Sci_4

The chart represents a count of events across almost two months period of time taken from a set of predefined log files residing in each node of the system. Larger circles represent a higher count value (circle sizes are normalized with respect to each log file). Our visualization detects a system-wide (“global”) event, clearly visible in the right hand side of the figure, towards the end of the monitoring period. The field support specialist can use the visualization tool to zoom in on areas of interest for further analysis within a few seconds or get additional information about events by hovering over the circles.

As mentioned above, this is just an example on how to extend the basic capabilities log analysis tools have. The general idea is to leverage an existing distributed file system such as HDFS, and build a monitoring tool which processes and analyzes the data from log files using MapReduce in parallel. The results can then be published as an easy-to-use, web-based, interface that can drill down into the system, examining its overall health. Among other things, we envision features such as a visual timeline of events, log content analysis, and quick data access via zooming in on events of interest. For applications where real-time logs processing might be valuable, other higher performance tools such as GemFire can be leveraged.

For interested readers, we provide Python code sample which can be used for the basic log files parsing. As one may note from the code snippets the most interesting and challenging here was the visualization part: to fit hundreds thousands of events and a two-month timeline with five minute intervals on a single screen so that it can be easily and clearly interpreted by a human; while the actual parsing and analytical calculations happen within a few python code rows only.

[sourcecode language=”python” wraplines=”false” collapse=”true”]
import os
import pandas as pd
import csv

”’
Script example for processing log files of specified event types.
Assuming event log files are stored under ‘Logs’ directory
in folders named in correspondens with event types.
”’
#### Function for extracting all timestamp entries from a log file ####
def extractEventTimeStamps(filePath):
# Read file
lines = list(csv.reader(open(filePath)))
# Filter out empty entries
lines_filtered = [line for line in lines if len(line) > 0]
# Create a list of all timestamps
ts_list = [line[0].split(‘ ‘)[0] for line in lines_filtered]
return ts_list

#### Function for processing all log files in a folder: ####
#### read & parse content, compute required statistics ####
def processLogs(folderPath):
infos = []
# List all files in directory
for root, dirs, files in os.walk(folderPath):
for f in files:
infos.append((root,os.path.join(root,f)))
# Parse and store timestamps from the files
all_ts=[]
for dname, fname in infos:
ts_list=extractEventTimeStamps(fname)
all_ts.extend(ts_list)
# Convert timestamps to datetime format
ts_formated = [pd.to_datetime(ts_str) for ts_str in all_ts]
# Create a time series with value “1” for each timestamp entry
ts = pd.Series([1]*len(ts_formated), index=ts_formated)
# Aggreagte and count number of entries in 5 min. buckets
counts=ts.resample(‘5min’, how=’sum’)
return counts

#### Read & Process Log Data ####
eventTypes = [‘service1′,’service2′,’service3′,’service4′,’service5′,’service6′,’service7’]
logsDir = os.getcwd() + ‘Logs’
# Create an empty DataFrame for storing the resuts. It will be used as input for the plotting script.
eventsDF = pd.DataFrame()
# For each event type
for eventType in eventTypes:
eventLogsFolder = logsDir + eventType
# Process the log files
eventCounts = processLogs(eventLogsFolder)
# Add results as a new column to the data frame
eventsDF[eventType] = eventCounts
[/sourcecode]

For events timeline visualization we have used Bokeh – an interactive web plotting library for Python.

[sourcecode language=”python” wraplines=”false” collapse=”true”]
import pandas as pd
import numpy as np
from bokeh.objects import HoverTool
from collections import OrderedDict
from bokeh.plotting import *
from bokeh.objects import Range1d
from datetime import timedelta

# Set the dates of interest for visualisation
fromDate = pd.datetime(2014,1,10,0,0,0)
toDate = pd.datetime(2014,3,10,0,0,0)

# eventsDF – events count data as created by ‘parser.py’ script
df1 = eventsDF.ix[fromDate:toDate]

# Output static HTML file
output_file(“ServiceEventsViz.html”, title=”Service Events Visualization”)
# Add renderers to the current existing plot
hold()
# Create a figure
figure()

# Setup plot parameters
labels = list(df1.columns.values) # setting lebels to eventType names
event_dates = [time for time in df1.index]
event_dates_str = [t.strftime(‘%Y-%m-%d %H:%M’) for t in event_dates]
# Setup plot range slightly wider than the analysis range
plot_range = [np.datetime64(x).astype(long)/1000 for x in [fromDate-timedelta(days=2), toDate+timedelta(days=2)]]
y_categories=[str(i) for i in range(len(df1.columns))]
# Define the color palette for event types
colormap = brewer[“Spectral”][len(y_categories)]

# For each event type
for i in range(len(df1.columns)):
# Get values
values=df1[df1.columns[i]].values
values=np.nan_to_num(values)
# Normilize the values
max_val = max(values)
nVals = [v/max_val*40 for v in values]
# set y-axis values for circles
y=[str(i)]*len(event_dates)
# Put the data into a ColumnDataSource, it is used for hover tool also
source = ColumnDataSource(
data=dict(
ti=event_dates_str,
log=[labels[i]]*len(event_dates_str),
c=values,
)
)
# Render circles for the event data points – circle’s size is set to represent normilized number of events
circle(event_dates, y, size=nVals, source=source, color=colormap[i], alpha=0.5, line_color=’black’,
x_axis_type=”datetime”,
x_range=Range1d(start=plot_range[0], end=plot_range[1]),
y_range=y_categories,
tools=”pan,wheel_zoom,box_zoom,reset,previewsave,resize,hover”,
plot_width=1200,plot_height=600)

# Add labels with event type names along the y-axis
text([pd.to_datetime(event_dates_str[-1]) + timedelta(days=1.5)]*len(y_categories), y_categories, text=labels, angle=0, text_font_size=”10pt”, text_align=”left”, text_baseline=”middle”)

# Layout customization
plot = curplot()
plot.title = ” Service Events ”
grid().grid_line_alpha=0.3
axis()[1].major_label_text_font_size = “0pt”
axis()[1].axis_line_color = None
axis()[1].major_tick_line_color = None

# Add some info for the hover tool
hover = [t for t in curplot().tools if isinstance(t, HoverTool)][0]
hover.tooltips = OrderedDict([
(‘date & time’, ‘@ti’),
(‘file name’, ‘@log’),
(‘num. of records’, ‘@c’),
])

# Open a browser
show()
[/sourcecode]

To learn more about EMC IT Data Science efforts, read previous blogs from our data scientists:

About the Author: Lena Tenenboim-Chekina