Customer support can be fantastically rewarding sometimes. When your combination of skill, tenacity and knowledge produce a solution that’s straightforward and effective, the feeling of satisfaction is hard to match. It doesn’t even have to be something big, we’ll take those small victories gladly. Our work sometimes looks a bit like magic, and we don’t mind one bit.
A few weeks ago one of our customers reported problems with streaming media from their server. Clients were taking about 20-30sec to connect, which was of course unacceptable, and they were suspecting something was wrong on our side, perhaps congestion or some over-zealous border firewall. The redundant connectivity we purchase is well above requirements even in the face of failure, and we don’t oversell bandwidth, so the former wasn’t a possibility. We use Linux’s netfilter framework for firewalling and routing (some people are really surprised when they learn we’re not using an enterprise-y appliance box) so the latter wasn’t likely. Thanks, Cisco, but we know plenty about security; we don’t need you to silently break our SMTP transactions for us in the name of “security”.
The server had been setup to listen on port 443 (to get around annoying corporate firewalls and the like) instead of the defaults in the config file, which looked to be ports 8083 and 33840. Not that this should be a problem, I checked the ports the server was listening on with netstat and confirmed that the firewall was letting it through (the media server is a java app, by the way).
netstat -tunlp | grep java
The customer had been in contact with the vendor of the media streaming server software for a little while now, and they were adamant it was a problem with our network.
One of the most useful tools in our arsenal for diagnosing network issues is tcpdump. With a rule-specification syntax that’s probably rich enough to solve a three-body problem, it’s very easy to drill down and find what you need in the flow of information. In this case I got the customer’s IP address and asked them to attempt a few connections, using a filter something like this:
tcpdump -i any host 220.127.116.11 and not tcp port ssh
What I immediately saw was a wave of connection attempts to port 1935. “But there’s nothing listening there”, I thought. Puzzled, I dropped the firewall and asked them to try again; it worked immediately, and they were quite happy to leave the firewall off at that point. We didn’t want to do this of course, so I asked them to persevere for a bit.
After raising the firewall again I asked the customer to connect again and watched the packets intently. Requests were arriving on 1935, being dropped by the firewall, then retried, which is consistent with exponential backoff behaviour. Almost exactly 20sec later, a connection attempt arrived on port 443 and the customer commented that it had finally connected.
Aha! Everything suddenly clicked. The client was attempting a connection on 1935, which it turns out is a standard port used by the Flash content server. With nothing listening there, the standard procedure is to have the firewall drop all such packets and not bother replying. The client, assuming the possibility of a congested/lossy network, keeps retrying for 20 seconds, gives up, tries another port, and immediately succeeds. With the firewall down, the OS instead replies with a TCP RST to tell the client there’s nothing there, so it tries the alternate port straight away.
The solution then, was simple: ensure that the client gets a TCP RST for connections to port 1935. This is similar to what’s done for port 113 (the ident protocol that’s familiar to IRC users), which can also delay connections to the server. Filtergen doesn’t appear to have a way to specify how to REJECT, so it just uses the default ICMP port-unreachable. This should be sufficient, but testing showed that the client wasn’t getting them, so I just allowed the connections through in the end. The IP stack caught them and all was well. If you’re using raw iptables rules, something like this will do the job.
iptables -I INPUT -p tcp –dport 1935 -j REJECT –reject-with tcp-reset
All up this probably only took 10-15min on the phone with the customer. For them, after a number of hours of fruitless grappling with vendor tech support, that’s magic.