Citavi Cloud Connection Issues in November 2019
If you were working with a cloud project in November 2019, you may have noticed multiple error messages informing you that there was no connection to the cloud. However, the monitoring page http://status.citavi.com showed that all systems were running smoothly. Below we explain what caused this problem and what we have done to address it.
Information about Citavi's Architecture
Beginning with Citavi 6, projects can now be saved to three possible locations: to the user's local drive, to an SQL server (for companies and institutions), and to the Citavi Cloud.
In all three cases, it is possible for multiple users to work on a project at the same time. This is easiest with the cloud, since it doesn't require users to be connected to the same network drive or database server.
Allowing multiple users to work simultaneously on the same project is a fairly complex technical task. The Citavi desktop client (the program that runs on your Windows computer) has to constantly communicate any changes other users are making to the project and to synchronize them with changes you are making.
This process is particularly demanding when using the cloud. If all the Citavi clients were constantly bombarding the servers with requests for changes, it would overload the servers. To prevent this, the server creates an established connection with each individual client. If User B makes a change to a project, the server uses this existing connection to send the change to User A's Citavi client, which retrieves the most current data and integrates it appropriately.
A Necessary Update
When we developed the server code for the Citavi Cloud, ASP.NET was the latest technology. The Microsoft technology for building established server-client connections was called "ASP.NET SignalR."
Since then, Microsoft has developed its server technology further. "ASP.NET Core" is now state-of-the-art. We changed our servers over to this new technology in November. Along with ASP.NET Core, a new version of SignalR was released: "ASP.NET Core SignalR."
The Need for Backwards Compatibility
Unfortunately, the "Core" version of SignalR is not backwards-compatible with the older version. For existing Citavi clients (versions 6.0 to 6.3) to continue to function, we had to develop a "SignalR redirector," which preserves compatibility with these clients.
This redirector caused massive problems in November. For users, these have appeared most noticeably as frequent interruptions to the cloud connection.
One Problem, Many Causes
Problems with the SignalR Redirector have manifested as so-called "bursts:" sudden, massive increases in server load. We have repeatedly introduced new versions in the early morning hours and observed the servers. At this time of day, everything has run perfectly.
Around 9:30 AM CET, as increasing numbers of Citavi users have begun working on cloud projects, there have then been sudden and massive surges (by multiples of thousands) in the number of server-client connections.
This explosion cannot be due to actual user behavior, but must be the result of a technical malfunction. It was a challenge to identify and address the cause – especially because it turned out to be not a single cause, but rather a complex chain of causes, which we were unable to recreate in our own functional and load tests.
- The oldest Citavi clients (versions 6.0 and 6.1) contain faulty logic on our end, which leads to frequent reconnects. This problem was addressed in versions 6.2 and later, but there are users who still work with the older versions.
We started by blocking reconnects from these clients at the server-level.
- At the same time, we dramatically increased both the number and capacity of our servers, up to as many as 10 high-performance servers.
- Unfortunately, these steps did not result in the reliability we had hoped for. We discovered that the encryption method we had used when distributing the load over 10 servers had increased CPU load too dramatically. Switching to a different encryption method fixed this problem, but the bursts continued to occur.
- After numerous further efforts, we identified a place in the source code of Microsoft's SignalR library that we now believe to be the root cause of the problem. SignalR sends a "keepalive" message in a "foreach" loop to all clients every second. Standard configuration of the clients means they expect a keepalive message at least every two seconds. If they do not receive one, they initiate a reconnect.
Depending on server load, it can take more than 2 seconds to transmit the keepalive message. The clients thus attempt to reconnect, which increases server load, which slows transmission of keepalive messages, which causes clients to attempt to reconnect again, and so on in a loop that eventually results in the fatal overload of the CPU and memory of all the servers.
Microsoft has changed the keepalive protocol in the newer ASP.NET Core library. However, as previously mentioned, we have to use the older library to remain compatible with older Citavi Clients.
We have now identified a way of sending clients a configuration parameter upon initial connection so that they no longer expect a keepalive signal every two seconds. In order to force all clients (including those on laptops in sleep mode) to break their connection and incorporate this new configuration, we have shut off the SignalR server twice for about 10 minutes.
These measures and the reboot have been effective. Since December 6, 2019, communication with all versions of Citavi 6 has been completely stable.
What have we learned?
1) A technical problem that has multiple, parallel causes, is always difficult to solve. We say that by way of explanation, not excuse.
2) Similarly, it is difficult to draw lessons from the problems stemming from Microsoft’s SignalR code. Third-party function libraries are indispensable in software development today. We - like every other software company - are dependent on their reliability. This is why we only use libraries from established and trusted companies.
In the case at hand, the problem was that although Microsoft had developed a new version with better functionality, we were unable to use it for compatibility reasons. This situation will not fundamentally change for Citavi in the future; of course we will always maintain state-of-the-art servers, and of course we will also continue to ensure reliability with older Citavi clients.
On the positive side, Microsoft has made the entire .NET Framework open source. This allowed us to analyze SignalR’s source code and to identify the problematic loop method.
3) We already have extensive testing procedures in place:
- Every night, the most recent code is checked by well over 1000 automatic unit tests.
- Every night, an extensive suite of automatic tests is run, which clicks through and checks the user interface.
- Load tests also run every night, which check the system's behavior under pressure.
- These load tests evidently did not simulate the various older versions of Citavi precisely enough in the case at hand; otherwise we would have identified the problems in the testing phase instead of in active use. We have already undertaken improvements to these tests.
4) Previously, when updating our servers, we have always conducted a "hard" update: data traffic was rerouted from the old version to the new version of the code within a very short time.
We are working on a "soft" update process, in which data traffic will be switched over to the new code gradually, allowing us to test the new code at 10%, 20%, 50%, and finally 100% of real data traffic.
5) Azure, Microsoft's Cloud, is a highly developed and competitive platform - as can also be seen by its steadily growing market share. SignalR technology is an outlier in that respect, however.
For Citavi 6.4 and beyond, we will switch to a different provider of real-time communication that successfully hosts major corporations and ISPs.
At no point did the actual Citavi servers or databases fail. There was a communication error between the servers and clients. However, we are very cognizant of the difficulties this issue caused for our valued users, to whom we extend our heartfelt apologies.