Overcoming Data Analysis Challenges in Complex Large-Scale IoT Environments

In this blog post we will examine the challenges of in-depth data analysis for large IoT deployments and what can be done to overcome these challenges to be able to find important insights through data analysis in this field.

Michael Patkin

May 4, 2022

In September 2020, while analyzing PCR test data, the first mutation of the SARS-Cov2 coronavirus (a.k.a Alpha) was discovered. The absence of one of the three signals, which usually indicated the presence of the virus, was found. Although this tiny anomaly did not affect the overall test result, it had a huge impact on the entire world, as it allowed health officials and makers to start preparing for the next outbreak. This example illustrates the importance of diving into the data, well beyond the negative/positive result to find insights in the data that can help predict small issues before they become big problems. Unfortunately, applying in-depth data analysis on IoT deployments is not as straightforward. In this blog post we will examine the challenges of in-depth data analysis for large IoT deployments and what can be done to overcome these challenges to be able to find important insights through data analysis in this field.

Data collection from IoT deployments

In large IoT deployments, data collection and analysis can be used to identify issues at an early stage. These issues can be related to the health of the devices themselves (e.g., communication issues) and/or related to the sensed data (e.g., an IoT device that senses the temperature of a machine).

As far as the health of the IoT devices, in most cases the monitoring is relatively shallow. Usually, only network activity is monitored as well as general metrics coming from the devices' OS, such as CPU usage and memory load. Everything that happens inside the processes, running on the device, is not collected and certainly not analyzed. IoT manufacturers and enterprises do not always know what is important to monitor, they do not always pay attention to the behavior of third-party libraries, which are responsible for some of the device's most critical operations, such as encryption, OTA, communicationstacks, etc. And even if they manage to collect in-depth data, they don't always have the means to process the data coming from millions of IoT devices in order to generate insights from them.

But not all IoT devices are alike in this respect. IoT devices running Linux OS are generally more robust and can allow data collection.However, IoT devices running RTOS (Real-timeOperating System) usually present a big challenge in terms of data collection.Without capturing the events' data inside the running functions/processes, it's almost impossible to perform predictive maintenance and even debugging and solving issues becomes difficult as there can be no root-cause analysis without this type of visibility.

Overcoming the technical limitations by focusing on what's important

One way to overcome the challenges mentioned above is to collect as much as possible, but if there are limitations, to be able to focus on the few events that can yield valuable insights. For example, as part of Sternum IoTcyber security solution, Sternum uses an approach called Embedded IntegrityVerification (EIV), which collects in-depth data from all IoT devices inruntime. Sternum has the ability to look inside any function of any device and gather as much information as possible about that device. But this doesn't mean that the limitations mentioned above do not apply in our case. When data collection is limited, we combine expert knowledge with statistical analysis to understand which data points have yielded important insights and then we focus our data collection efforts specifically on these data points. By collecting the most important subset of the data, we make sure that the IoT device's performance is not affected by the collection.

For example, let's say monitor the usage of three functions –Function A, Function B, and Function C. After a certain period of time,following a statistical analysis, we can see that functions A and C are always used one after the other. This means that we will only need two functions out of the three, so we can exclude either Function A or Function C from the data collection, which can help reduce the amount of data that is collected.

Using AI for data analysis and anomaly detection

After collecting the in-depth data, it has to be analyzed to find anomalies that could indicate that there is an issue. When data is continuously flowing from all devices, it is possible to use AI in order to spot some data patterns that could be identified as anomalies. The AI looks at the data without really understanding it and its context.

Using AI to analyze the time-series data from a single parameter

AI can uncover anomalies that could not have been identified through simple heuristics even for simple single-parameter data. For example,let's say we want to use CPU or RAM utilization as data points for monitoring the health of a device or to identify a possible cyber-attack. As we can see figure 1 below, the CPU utilization was stuck at 100% for a short period of time. Figure 2 displays the RAM utilization in the device. In both graphs it is possible to see the anomalies with the naked eye, but when analyzing this data using simple heuristics, there is nothing wrong with it, as it may be ok when the CPU utilization reaches its maximum for a short period of time or when the average RAM utilization is above a certain threshold. But in both cases, it is possible to see, even with the naked eye, strange patterns in the graphs. AIcan identify these anomalies even in simple datasets like the single parameter ones displayed below. But where AI really shines is in more complex use cases.

Figure 1. Anomaly in CPU load stuck at 100%
Figure 2. Small motif embedded in an apparently random RAM usage

Using AI to find anomalies in multi-parameter data

Another use case that can illustrate how AI can be used to find anomalies in multi-parameter data is an oil pump. The oil pump includes multiple sensors to measure the temperature, pressure, current consumption by an electric motor, etc. The oil pump system strives to maintain the constant flow of oil by using an electrical motor. However, the oil characteristics may change based on weather conditions as follows:

  • When the temperature is low/cold - the viscosity of the oil is greater, requiring more power to pump the same amount of oil.
  • When the temperature is high/warm – the viscosity of the oil is lower, requiring less power to pump the same amount of oil.

So, the usage multivariable pattern is such that if the temperature goes down, the power consumption goes up as illustrated in the graph below.

During the data analysis it was found that after a certain period, the temperature went down, however it did not really increase the power consumption. As seen in Figure 3, which shows the link between temperature measurement and motor power, some points corresponding to low temperature and low motor power are abnormal (highlighted in red). This anomaly could indicate the oil may have been partially replaced by another fluid such as water, which has very different viscosity characteristics or that the oil quality has been reduced by mixing it with another substance.

Figure 3. Anomaly in the operation of the fuel pump

Summary

As we've seen, using AI as part of your monitoring efforts can uncover issues that can otherwise remain undetected. However, AI analysis requires in-depth data collection, which can be often challenging for large IoT environments. These challenges can be solved by using technologies such as Sternum’s EIV, which combines expert knowledge with statistical analysis to focus the data collection efforts. Once the rich data is collected, it is possible to apply AI to discover anomalies in the data patterns, which could indicate that there is an issue. These issues can be identified in simple data (e.g., a single parameter level) as well as more complex multi-parameter datasets.