How to investigate errors too many open files

Last week, at my job, we were trying to find the root problem that was killing one of our applications. It is a rails app running on Debian, and we had some clues about the problem:
- looking at New Relic errors, we saw many errors like "getaddrinfo: Name or service not known"
- looking at Unicorn logs, there were a lot of "too many open files" errors

It seemed the application was being killed by these errors.

We thought that the server may have some network problems, as this could explain the "Name or service not known" error which is an error that happens when some domain can't be resolved with DNS.

But after some research, we remembered this kind of "too many open files" error is related with the number of open files in the filesystem. It is possible to list all opened files using the command lsof. We ran this command, but the number was too little. It didn't look like the default limit of 1024 was being reached.

So we kept searching for some answer to our problem, and we found another way to list the current open files of a process using a command like this:

ls -la /proc/3591/fd

This command shows all the file descriptors related to a process (pid).

We ran this command to our process, and we noticed many file descriptors were not being listed due to permission constraints.

When we ran the same command with root, all the FDs were there listed. So we thought 'Lets try to run lsof with root to see if the result is different', and it was - a big number had appeared!

Running lsof, it is possible to filter the output using grep, so we could analyse why there were so many opened files.

After some analysis, and using a little of bash script, we ended up with this command:

lsof | grep -e "^ruby" | awk '{print $9}' | grep imap | wc

This showed us a lot of opened IMAP connections.

The app uses IMAP to get some information about the mailboxes, and the connections were not being closed after it, so the problem was found!

So the lesson is, when investigating errors like "too many open files", run lsof with root!

Here some links I found to understand it better: