Clone Wars

In this post I wish to discuss a common coding error that can result in a module running out of or at least seriously depleting the available number of stcp device clones for use in creating TCP sockets.

Every call to the socket function results in STCP creating a new clone of the stcp device. The maximum number of clones that can be created is controlled by the clone_limit field in the devices.tin entry.

/    =name               stcp.m17
     =module_name        m17
     =device_type        streams
     =access_list_name   stcp_access
     =streams_driver     stcp
     =clone_limit        5120
     =comment            Provides TCP API

You can see how many clones are currently in use by dumping the device structure in analyze_system and looking at the clone_count value. If clone_count equals clone_limit then calls to the socket function will return an e$clone_limit_exceeded: “The clone limit for the device has been exceeded” error.

as:  match clone; dump_dvt -name stcp.m17
clone_limit:       5120
clone_count:       42
cloned_from:       -27271
remote_clone_limit: 0

In general the socket call that creates the clone device is followed by either a connect call or bind and listen calls. At that point you can see a corresponding entry when you execute the “netstat command”.

netstat -numeric -all_sockets
Active connections (including servers)
Proto Recv-Q Send-Q  Local Address      Foreign Address    (state)
tcp        0      0  172.16.124.217:23  192.168.109.22:50038 ESTABLISHED
tcp        0      0  172.16.124.217:22  172.16.124.24:54987 ESTABLISHED
tcp        0      0  10.20.1.1:37       10.20.1.26:1528    TIME_WAIT
tcp        0      0  10.20.1.1:37       10.20.1.27:1579    TIME_WAIT
tcp        0      0  172.16.124.217:61780 192.168.109.22:23 ESTABLISHED
tcp        0      0  172.16.124.217:22  172.16.124.50:17421 ESTABLISHED
tcp        0      0  172.16.124.217:22  172.16.124.50:17658 ESTABLISHED
tcp        0      0  *:23               *:*                LISTEN
tcp        0      0  *:6666             *:*                LISTEN
tcp        0      0  *:21               *:*                LISTEN
tcp        0      0  *:3000             *:*                LISTEN
tcp        0      0  *:7                *:*                LISTEN
tcp        0      0  *:9                *:*                LISTEN
tcp        0      0  *:13               *:*                LISTEN
tcp        0      0  *:19               *:*                LISTEN
tcp        0      0  *:37               *:*                LISTEN
tcp        0      0  *:901              *:*                LISTEN
tcp        0      0  *:1414             *:*                LISTEN
tcp        0      0  *:81               *:*                LISTEN
tcp        0      0  10.20.1.1:37       10.20.1.9:3633     TIME_WAIT
tcp        0     50  10.10.1.1:52653    10.10.1.200:3001   ESTABLISHED
tcp        0      0  10.10.1.1:52624    10.10.1.200:3001   FIN_WAIT_1
tcp        0      0  10.20.1.1:61704    10.20.1.3:48879    ESTABLISHED
tcp        0      0  *:3001             *:*                LISTEN
tcp        0      0  *:3002             *:*                LISTEN
tcp        0      0  *:3003             *:*                LISTEN
tcp        0      0  *:4000             *:*                LISTEN
tcp        0      0  172.16.124.217:4000 172.16.124.78:1024 ESTABLISHED
tcp        0      0  172.16.124.217:4000 172.16.124.227:1025 ESTABLISHED
tcp        0      0  *:4001             *:*                LISTEN
tcp        0      0  *:4002             *:*                LISTEN
tcp        0      0  *:4003             *:*                LISTEN
tcp        0      0  *:4004             *:*                LISTEN
tcp        0      0  *:22               *:*                LISTEN
tcp        0      0  *:4005             *:*                LISTEN
tcp        0      0  *:4006             *:*                LISTEN
tcp        0      0  172.16.124.217:4006 172.16.124.203:49231 ESTABLISHED
tcp        0      0  *:4007             *:*                LISTEN
tcp        0      0  *:4008             *:*                LISTEN
tcp        0      0  *:4009             *:*                LISTEN
tcp        0      0  172.16.124.217:4008 172.16.124.203:49262 ESTABLISHED
tcp        0      0  *:4010             *:*                LISTEN
tcp        0      0  *:4011             *:*                LISTEN
tcp        0      0  *:4012             *:*                LISTEN
tcp        0      0  *:4013             *:*                LISTEN
tcp        0      0  *:4014             *:*                LISTEN
tcp        0      0  *:4015             *:*                LISTEN
tcp        0      0  *:80               *:*                LISTEN
tcp        0      0  *:9182             *:*                LISTEN
tcp        0      0  *:445              *:*                LISTEN
tcp        0      0  *:139              *:*                LISTEN
tcp        0      0  10.20.1.1:53495    10.20.1.9:48879    ESTABLISHED
tcp        0      0  10.20.1.1:61703    10.20.1.3:48879    ESTABLISHED
tcp        0      0  10.20.1.1:61707    10.20.1.3:48879    ESTABLISHED
tcp        0      0  10.20.1.1:61705    10.20.1.9:48879    ESTABLISHED
tcp        0      0  10.20.1.1:61709    10.20.1.9:48879    ESTABLISHED
tcp        0      0  10.20.1.1:61710    10.20.1.9:48879    ESTABLISHED
tcp        0      0  172.16.124.217:61789 172.16.124.203:4000 ESTABLISHED
tcp        0    400  172.16.124.217:22  172.16.124.50:17674 ESTABLISHED

If you count up the number of lines you will see more than 42, that is because not every entry shown by netstat uses an stcp device clone. For example, OSL connections and the X25_cpc connections used with the NIO. Take a look at socket_count.cm for more details.

If a socket call is made without a connect or bind or the bind fails you can create the opposite issue, the value of clone_count is larger than the number of entries shown by netstat.

as:  match clone; dump_dvt -name stcp.m17
clone_limit:       5120
clone_count:       4131
cloned_from:       -23179
remote_clone_limit: 0
as:

I am not including the netstat output again, but trust me it hasn’t changed from the previous example.

This situation, an extra (4131 – 42) and apparently unaccounted for, STCP device clones was created by the following code fragment. The code calls the socket function followed by the bind function. If the bind fails it loops. Many applications would add a timer to wait 1, 60, or 300 seconds and try again, that just delays the inevitable, assuming of course that the condition causing the error does not go away.

tryAgain = 1;
while (tryAgain)
  {
  if ((socks0 = socket (AF_INET, SOCK_STREAM, 0)) < 0)
     {
     if (debugFlag)
        perror ("badService: can't create listening socket");
     }
  else {
/* build a sockaddr structure holding the address we will bind to.
   The IP address is INADDR_ANY meaning we will listen on all active
   IP addresses */

     bzero ( (char *) &serv_addr, sizeof (serv_addr));
     serv_addr.sin_family        = AF_INET;
     serv_addr.sin_addr.s_addr   = htonl (INADDR_ANY);
     serv_addr.sin_port          = htons (portNumber);

/* now bind to the address and port */
     if (bind (socks0, (struct sockaddr *) &serv_addr,
                                      sizeof (serv_addr)) < 0)
        {
        if (debugFlag)
           perror ("badService: can't bind address, trying again");
        }
     else
        tryAgain = 0;
     }
   }

The most common error is that another process already has bound to the requested port. Regardless of the reason for the error the solution is to close the socket after reporting the bind error.

tryAgain = 1;
while (tryAgain)
  {
  if ((socks0 = socket (AF_INET, SOCK_STREAM, 0)) < 0)
     {
     if (debugFlag)
        perror ("goodService: can't create listening socket");
     }
  else {
/* build a sockaddr structure holding the address we will bind to.
   The IP address is INADDR_ANY meaning we will listen on all active
   IP addresses */

     bzero ( (char *) &serv_addr, sizeof (serv_addr));
     serv_addr.sin_family        = AF_INET;
     serv_addr.sin_addr.s_addr   = htonl (INADDR_ANY);
     serv_addr.sin_port          = htons (portNumber);

/* now bind to the address and port */
     if (bind (socks0, (struct sockaddr *) &serv_addr,
                                      sizeof (serv_addr)) < 0)
        {
        if (debugFlag)
           perror ("goodService: can't bind address, trying again");
        if (close (socks0) < 0)
           if (debugFlag)
              perror ("goodService: can't close old socket");
        }
     else
        tryAgain = 0;
     }
   }

This post has been about reaching the clone_limit because of a coding error but what if there is no error, what if the application environment is really using all those clone devices. Well then, assuming you have not reached the system limit of 16,000 you can raise the limit. You need to update the clone_limit field of the stcp device in the devices.tin file and recreate the devices.table. If you are on a 17.1 or later release you can use the update_device_info command to raise the limit for the current boot and rely on the updated devices.table to take care of the next boot. On releases before 17.1 your only real option is to reboot. You should set the limit to a value that corresponds to your current needs plus expected growth; you should not just raise the limit to 16,000. Even if you do not have an application bug that is consuming clone devices now there is no guarantee that you will not have one in the future. An application that is consuming all the available clone devices will also consume a great deal of streams memory and exhausting streams memory will negatively affect existing TCP connections.

PARTNERS

TOPICS

QUICK LINKS