In this post I wish to discuss a common coding error that can result in a module running out of or at least seriously depleting the available number of stcp device clones for use in creating TCP sockets.
Every call to the socket function results in STCP creating a new clone of the stcp device. The maximum number of clones that can be created is controlled by the clone_limit field in the devices.tin entry.
/ =name stcp.m17
=module_name m17
=device_type streams
=access_list_name stcp_access
=streams_driver stcp
=clone_limit 5120
=comment Provides TCP API
|
You can see how many clones are currently in use by dumping the device structure in analyze_system and looking at the clone_count value. If clone_count equals clone_limit then calls to the socket function will return an e$clone_limit_exceeded: “The clone limit for the device has been exceeded” error.
as: match clone; dump_dvt -name stcp.m17
clone_limit: 5120
clone_count: 42
cloned_from: -27271
remote_clone_limit: 0
|
In general the socket call that creates the clone device is followed by either a connect call or bind and listen calls. At that point you can see a corresponding entry when you execute the “netstat command”.
netstat -numeric -all_sockets Active connections (including servers) Proto Recv-Q Send-Q Local Address Foreign Address (state) tcp 0 0 172.16.124.217:23 192.168.109.22:50038 ESTABLISHED tcp 0 0 172.16.124.217:22 172.16.124.24:54987 ESTABLISHED tcp 0 0 10.20.1.1:37 10.20.1.26:1528 TIME_WAIT tcp 0 0 10.20.1.1:37 10.20.1.27:1579 TIME_WAIT tcp 0 0 172.16.124.217:61780 192.168.109.22:23 ESTABLISHED tcp 0 0 172.16.124.217:22 172.16.124.50:17421 ESTABLISHED tcp 0 0 172.16.124.217:22 172.16.124.50:17658 ESTABLISHED tcp 0 0 *:23 *:* LISTEN tcp 0 0 *:6666 *:* LISTEN tcp 0 0 *:21 *:* LISTEN tcp 0 0 *:3000 *:* LISTEN tcp 0 0 *:7 *:* LISTEN tcp 0 0 *:9 *:* LISTEN tcp 0 0 *:13 *:* LISTEN tcp 0 0 *:19 *:* LISTEN tcp 0 0 *:37 *:* LISTEN tcp 0 0 *:901 *:* LISTEN tcp 0 0 *:1414 *:* LISTEN tcp 0 0 *:81 *:* LISTEN tcp 0 0 10.20.1.1:37 10.20.1.9:3633 TIME_WAIT tcp 0 50 10.10.1.1:52653 10.10.1.200:3001 ESTABLISHED tcp 0 0 10.10.1.1:52624 10.10.1.200:3001 FIN_WAIT_1 tcp 0 0 10.20.1.1:61704 10.20.1.3:48879 ESTABLISHED tcp 0 0 *:3001 *:* LISTEN tcp 0 0 *:3002 *:* LISTEN tcp 0 0 *:3003 *:* LISTEN tcp 0 0 *:4000 *:* LISTEN tcp 0 0 172.16.124.217:4000 172.16.124.78:1024 ESTABLISHED tcp 0 0 172.16.124.217:4000 172.16.124.227:1025 ESTABLISHED tcp 0 0 *:4001 *:* LISTEN tcp 0 0 *:4002 *:* LISTEN tcp 0 0 *:4003 *:* LISTEN tcp 0 0 *:4004 *:* LISTEN tcp 0 0 *:22 *:* LISTEN tcp 0 0 *:4005 *:* LISTEN tcp 0 0 *:4006 *:* LISTEN tcp 0 0 172.16.124.217:4006 172.16.124.203:49231 ESTABLISHED tcp 0 0 *:4007 *:* LISTEN tcp 0 0 *:4008 *:* LISTEN tcp 0 0 *:4009 *:* LISTEN tcp 0 0 172.16.124.217:4008 172.16.124.203:49262 ESTABLISHED tcp 0 0 *:4010 *:* LISTEN tcp 0 0 *:4011 *:* LISTEN tcp 0 0 *:4012 *:* LISTEN tcp 0 0 *:4013 *:* LISTEN tcp 0 0 *:4014 *:* LISTEN tcp 0 0 *:4015 *:* LISTEN tcp 0 0 *:80 *:* LISTEN tcp 0 0 *:9182 *:* LISTEN tcp 0 0 *:445 *:* LISTEN tcp 0 0 *:139 *:* LISTEN tcp 0 0 10.20.1.1:53495 10.20.1.9:48879 ESTABLISHED tcp 0 0 10.20.1.1:61703 10.20.1.3:48879 ESTABLISHED tcp 0 0 10.20.1.1:61707 10.20.1.3:48879 ESTABLISHED tcp 0 0 10.20.1.1:61705 10.20.1.9:48879 ESTABLISHED tcp 0 0 10.20.1.1:61709 10.20.1.9:48879 ESTABLISHED tcp 0 0 10.20.1.1:61710 10.20.1.9:48879 ESTABLISHED tcp 0 0 172.16.124.217:61789 172.16.124.203:4000 ESTABLISHED tcp 0 400 172.16.124.217:22 172.16.124.50:17674 ESTABLISHED |
If you count up the number of lines you will see more than 42, that is because not every entry shown by netstat uses an stcp device clone. For example, OSL connections and the X25_cpc connections used with the NIO. Take a look at socket_count.cm for more details.
If a socket call is made without a connect or bind or the bind fails you can create the opposite issue, the value of clone_count is larger than the number of entries shown by netstat.
as: match clone; dump_dvt -name stcp.m17
clone_limit: 5120
clone_count: 4131
cloned_from: -23179
remote_clone_limit: 0
as:
|
I am not including the netstat output again, but trust me it hasn’t changed from the previous example.
This situation, an extra (4131 – 42) and apparently unaccounted for, STCP device clones was created by the following code fragment. The code calls the socket function followed by the bind function. If the bind fails it loops. Many applications would add a timer to wait 1, 60, or 300 seconds and try again, that just delays the inevitable, assuming of course that the condition causing the error does not go away.
tryAgain = 1; while (tryAgain) { if ((socks0 = socket (AF_INET, SOCK_STREAM, 0)) < 0) { if (debugFlag) perror ("badService: can't create listening socket"); } else { /* build a sockaddr structure holding the address we will bind to. The IP address is INADDR_ANY meaning we will listen on all active IP addresses */ bzero ( (char *) &serv_addr, sizeof (serv_addr)); serv_addr.sin_family = AF_INET; serv_addr.sin_addr.s_addr = htonl (INADDR_ANY); serv_addr.sin_port = htons (portNumber); /* now bind to the address and port */ if (bind (socks0, (struct sockaddr *) &serv_addr, sizeof (serv_addr)) < 0) { if (debugFlag) perror ("badService: can't bind address, trying again"); } else tryAgain = 0; } } |
The most common error is that another process already has bound to the requested port. Regardless of the reason for the error the solution is to close the socket after reporting the bind error.
tryAgain = 1;
while (tryAgain)
{
if ((socks0 = socket (AF_INET, SOCK_STREAM, 0)) < 0)
{
if (debugFlag)
perror ("goodService: can't create listening socket");
}
else {
/* build a sockaddr structure holding the address we will bind to.
The IP address is INADDR_ANY meaning we will listen on all active
IP addresses */
bzero ( (char *) &serv_addr, sizeof (serv_addr));
serv_addr.sin_family = AF_INET;
serv_addr.sin_addr.s_addr = htonl (INADDR_ANY);
serv_addr.sin_port = htons (portNumber);
/* now bind to the address and port */
if (bind (socks0, (struct sockaddr *) &serv_addr,
sizeof (serv_addr)) < 0)
{
if (debugFlag)
perror ("goodService: can't bind address, trying again");
if (close (socks0) < 0)
if (debugFlag)
perror ("goodService: can't close old socket");
}
else
tryAgain = 0;
}
}
|
This post has been about reaching the clone_limit because of a coding error but what if there is no error, what if the application environment is really using all those clone devices. Well then, assuming you have not reached the system limit of 16,000 you can raise the limit. You need to update the clone_limit field of the stcp device in the devices.tin file and recreate the devices.table. If you are on a 17.1 or later release you can use the update_device_info command to raise the limit for the current boot and rely on the updated devices.table to take care of the next boot. On releases before 17.1 your only real option is to reboot. You should set the limit to a value that corresponds to your current needs plus expected growth; you should not just raise the limit to 16,000. Even if you do not have an application bug that is consuming clone devices now there is no guarantee that you will not have one in the future. An application that is consuming all the available clone devices will also consume a great deal of streams memory and exhausting streams memory will negatively affect existing TCP connections.