“The application has been running fine for years, last week the network was upgraded and we moved from 100 mbps to gigabit. Now the last half of the data in some messages is garbage. The network people swear it is not the network – but that is the only thing that changed.”
The good news is that the expensive network upgrade did not break the application.
The bad news is that the application is broken and has always been broken. What *many* people writing TCP applications do not realize is that TCP is a stream of bytes, not messages. The fact that an application sends two 1000 byte messages does not mean that the sending TCP stack will send two 1000 bytes TCP segments. The application messages can be segmented into smaller pieces based on either the receiver’s advertized maximum segment size or some configuration limitation on the sender. Retransmissions can also combine and fragment application messages. Even if the sending TCP stack actually sends two 1000 byte TCP segments there is no guarantee that the receiving TCP stack will give the receiving application two 1000 byte messages. For example, if the application’s buffer is only 500 bytes that is all the TCP stack will return. On the other hand if the buffer is 1500 bytes and both 1000 byte segments have arrived the TCP stack will return 1500 bytes in the first call and 500 in the second. It is up to the application to take the byte stream and parse it correctly back into messages.
What does this have to do with upgrading the network? Well the OpenVOS server and Linux client where on different subnets so the OpenVos system was advertising a maximum segment size of 536 bytes. The application messages sent by the client were 1000 bytes so the messages where being segmented into 2 pieces. Prior to the upgrade it appears that both segments were arriving at the OpenVOS server before the application posted its receive so all 1000 bytes were read with one call to the receive function. After the upgrade the segment timing changed so that some of the time only the first segment was available when the application posted its receive. The last 464 (1000 – 536) bytes of the application buffer were not filled in by the TCP stack and contained whatever garbage was there before the receive was posted.
In this case there was an easy quick fix, increase OpenVOS’s advertised MSS value to 1460 (see An easy way to improve TCP throughput across subnets). That however is really only a stop gap. The real solution will be to rewrite the code to correctly parse the TCP byte stream back into application messages instead of just assuming that 1 receive call will return 1 message.