Author: Norman Scaife (WALLIX)
Everyone who uses the internet has encountered Web Analytics. It has many incarnations but on the internet, it usually refers to the gathering and analysis of data concerning access patterns to Websites. This can be done for various reasons such as performance analysis, for example to improve response times to Website accesses, for commercial reasons such as testing the effectiveness of an advertising campaign, or for market research for example tracking popularity trends for online products or services. Given the near-ubiquity of Web Analytics on the internet it represents a colossal financial investment by many companies, large and small. It is difficult to get recent information about the economics of Web Analytics but as an example of the financial scale of the industry note that Google Analytics Premium (the paid equivalent of Google Analytics, the largest player in the field) is only recommended for Websites pulling more than 1M USD per annum.
Implicit in all the activities associated with Web Analytics are concerns about the privacy of users' actions and data which could be used in a malicious manner if they fall into the wrong hands or are misused by whoever is collecting the data. There is, in fact, controversy about the unrestricted deployment of such data gathering operations and this has led to a range of countermeasures which can circumvent the data collection process such as the NoScript, Disconnect, Ghostery, AdGuard and many other plugins which can simply block all accesses to Web Analytics sites. While this may appease privacy advocates it prevents some good work that can be done by Web Analytics which, if approached correctly, can actually make Websites better.
At WALLIX we are developing a method of performing Analytics in such a way that guarantees can be provided to the owners of the data being analysed that their original data will remain secure. The simplest semantic for these guarantees is that the original data will never appear externally to the clients' systems in an unencrypted form. This is a strong enough guarantee for most applications provided the encryption used is also sufficiently strong.
Our use case scenario is on a smaller scale than the likes of Google Analytics. We have developed an Open Source command line interface for configuring Amazon AWS servers which we call Awless. During development of this tool we attempted to improve the performance by analysing access patterns from our clients to the Amazon AWS Website. Unfortunately, we required unencrypted access to our clients' data which was not authorized for obvious reasons. Amazon AWS accounts can represent a significant monetary investment and can contain personally identifiable data such as usernames and emails.
This is a situation in which Functional Encryption (FE) is a good match. The theory is that we can ask all of our clients to encrypt their data and then send us the encrypted data plus a functional decryption key. Decryption then provides us with a function applied to the clients' data without us having access to the original data. If the FE protocols currently available there are several which can enable us access to the sum (and therefore average) of the clients' data. For this we initially chose a Distributed Multi-Client FE (DMCFE) which computes an inner product of the client data vector and a weighting vector. This maps in statistical terms to a weighted average over the clients' data.
We deploy the FENTEC FE library which has a Golang implementation of the DMCFE protocol (Awless is written in Golang). This protocol suits our use case very well because it allows us to generate the encryption keys on the clients' systems. Integration into our application proved very simple since for the DMCFE protocol it is quite obvious as to where each function in the protocol should be placed.
The major technical difficulty to be overcome is the data redistribution phase implied by the shared key generation. For this we were obliged to implement a common server for the clients to be used as a data gathering point. This represents a potential security weakness in our design but is essential since we cannot allow our clients to interact directly. With careful implementation, however, we believe we have an implementation which is both efficient and secure.
Note that we have to be careful with any FE implementation since there are pitfalls over and above the usual problems of traditional encryption such as securing the access keys. As a trivial example, consider that during an analytics run, if only one client responds then the FE results will actually be that clients' data. More complex analysis is possible such as differencing the results of overlapping poll data but our implementation has been checked for such weaknesses and found to be adequately secure.
In terms of functionality, we have achieved our goals for the WALLIX use case. We are now able to perform the safe Analytics we wished to perform at the start of the project. We do not quite have total coverage of our original data analysis requirements, for example we were planning to measure the maximum of the maximum memory usage counts across our clients' machines. That cannot be done since the FE results would then be equivalent to the value of one of our clients' data, even if we are unable to say which client possesses those data.
One problem with multiple FE (multi-input or multi-client) is that in order to decrypt and generate the result values, all the key values from the clients are necessary. This can make the analytics process rather fragile. For instance, if a client crashes or decides maliciously not to return its public key then the whole protocol fails and the result is inaccessible. From experience with demonstrating the process, we conclude that a greater degree of robustness is required in practice. We achieve this by simple partitioning. We divide our clients into disjoint groups and perform a full protocol cycle upon each group. This gives us a degree of robustness since a missing key only means that we lose one partition. As a coincidental benefit, this opens up the possibility of performance improvements by implementing the partitions on distinct servers, and we can also apply outlier detection across the partitions.
The one aspect of the use case we have not yet determined is the scalability. For our own purposes, up to about 200 clients, the performance of the method is actually very good, but we would like to make more general statements about the applicability of the technique in a more general setting. For that we need to estimate the asymptotic behaviour of the technique to much higher numbers of clients. That will be our focus the final stages of the project.