Sleeping soundly even when scaling your app to 200k request/seconds

A roundup of seemingly inconspicuous things that we had to deal with as we went from zero to 200K POST requests/second at CleverTap. Things like engineering for reliable autoscaling (you don’t want to be caught off guard as traffic increases), optimising SSL handshakes to drive cost of data transfer and managing infrastructure on AWS

Always connected reverse SSH port forwarding with systemd

Replacing autossh, the de-facto for managing and monitoring SSH connections with systemd’s service.

Terms used

  • remote host refers to a device running in a third-party managed network i.e you have no control over any networking equipment. Its public IP may or mayn’t change
  • managed host refers to a server/device whose SSH port is reachable

Assumptions

  • remote host runs a distribution that uses systemd as an init system
  • A user named callhome on the remote host is able to SSH using public key authentication as incoming@managed host

systemd service configuration

Create a systemd service unit by adding the below mentioned config to a file called /etc/systemd/system/call-home.service.

[Unit]
Description=Forward local SSH port to remote host
After=network-online.target
Before=multi-user.target
DefaultDependencies=no

[Service]
# SSH connection uses the private key stored in this
# users home dir (~/.ssh/)
User=callhome

# SSH connection with port forwarding
# Forwards local port 22 to port 5000
ExecStart=/usr/bin/ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o ServerAliveInterval=20 -o ServerAliveCountMax=1 -o ExitOnForwardFailure=yes -N -T -R5000:localhost:22 incoming@managedhost.example.com

# wait 60 seconds before trying to restart the connection
# if it disconnects
RestartSec=60

# keep retrying no matter what
Restart=always

[Install]
WantedBy=multi-user.target

Ensure that this service starts at boot

root@raspberry-pi:~# systemctl enable call-home
Created symlink from /etc/systemd/system/multi-user.target.wants/call-home.service to /etc/systemd/system/call-home.service.
root@display1:~#

Start the service and test if port forwarding works

root@raspberry-pi:~# systemctl start call-home
# check to see if the connection was established
root@raspberry-pi:~# sudo journalctl -u call-home
Jun 25 18:03:00 raspberry-pi systemd[1]: Starting SSH reverse tunnelling...
Jun 25 18:03:00 raspberry-pi systemd[1]: Started SSH reverse tunnelling.
Jun 25 18:03:01 raspberry-pi ssh[23582]: Warning: Permanently added '1.2.3.4' (ECDSA) to the list of known hosts.

If everything worked, you should be able to connect to port 5000 on the managed host, authenticate and reach remote host,
like so:

root@ip-172-31-20-1 :~# ssh -p 5000 pi@127.0.0.1
pi@127.0.0.1's password:

The programs included with the Debian GNU/Linux system are free software;
the exact distribution terms for each program are described in the
individual files in /usr/share/doc/*/copyright.

Debian GNU/Linux comes with ABSOLUTELY NO WARRANTY, to the extent
permitted by applicable law.
Last login: Sat Jun 25 20:17:28 2016 from localhost
pi@raspberry-pi:~ $

Troubleshooting

When attempting to connect from managed host to remote host you get error ssh: connect to host 127.0.0.1 port 5000: Connection refused

  • On the remote host check if forwarding worked, like so:
pi@raspberry-pi:~ $ sudo journalctl -u call-home

Reverse proxying Kibana with Nginx at a subpath (/kibana)

Reverse proxying to Kibana when its hosted at root path (/) i.e https://kibana.tools.example.org/ works out-of-the-box. Problem is, most folks don’t have a dedicated certificate for each internal application. Instead, the common practice is to host apps via a subpath such as https://tools.example.org/kibana/. This is where things get tricky. Kibana supports a config variable called server.basePath that is supposed to set its base path so that all emitted links are prefix accordingly. This is supposed to make Kibana play nice when it behind a reverse proxy. As of today, the latest version still has issues ( #5171, #1555 and #6339)

Assuming you have kibana running on default port 5601, server.basePath set to “” and Nginx is configured to respond to hostname tools.example.org, the following Nginx configuration make Kibana play nice at subpath /kibana and /kibana/

-- snip --

# kibana is actually hosted at /app/kibana
# This redirect points it to the right direction
location = /kibana {
return 301 https://tools.example.org/app/kibana;
}

location = /kibana/ {
return 301 https://tools.example.org/app/kibana;
}

# by default Kibana redirects /app/kibana/ to /app/kibana
location = /app/kibana/ {
return 301 https://tools.example.org/app/kibana;
}

# this is where the app is served
location = /app/kibana {
proxy_pass http://kibana-host:5601;
}

# internal application links
location /app/kibana/ {
proxy_pass http://kibana-host:5601;
}

# static content is not relative to /app/kibana.
# instead its served at /bundles/*
# see https://github.com/elastic/kibana/issues/6339
location /bundles/ {
proxy_pass http://kibana-host:5601;
}

#
-- snip --

Logstash mysteriously returns connection reset when connecting to a HTTPS elasticsearch endpoint

I was on a wild goose chase today because connections to a newly setup elasticsearch fronted by nginx were failing from Logstash with error ‘Connection reset’. I was absolutely certain that the host trying to make a connection could connect. curl -v https://example.com/elasticsearch worked. For some reason Logstash could not connect. I assumed I wasn’t setting some elasticsearch plugin parameters correctly.

[root@ip-172-10-1-246 conf.d]# /etc/init.d/logstash configtest
Mar 21, 2016 4:42:31 PM org.apache.http.impl.execchain.RetryExec execute
INFO: I/O exception (java.net.SocketException) caught when processing request to {s}->https://es.example.com:443: Connection reset
Mar 21, 2016 4:42:31 PM org.apache.http.impl.execchain.RetryExec execute
INFO: Retrying request to {s}->https://es.example.com:443
Mar 21, 2016 4:42:31 PM org.apache.http.impl.execchain.RetryExec execute
INFO: I/O exception (java.net.SocketException) caught when processing request to {s}->https://es.example.com:443: Connection reset
Mar 21, 2016 4:42:31 PM org.apache.http.impl.execchain.RetryExec execute
INFO: Retrying request to {s}->https://es.example.com:443
Mar 21, 2016 4:42:31 PM org.apache.http.impl.execchain.RetryExec execute
INFO: I/O exception (java.net.SocketException) caught when processing request to {s}->https://es.example.com:443: Connection reset
Mar 21, 2016 4:42:31 PM org.apache.http.impl.execchain.RetryExec execute
INFO: Retrying request to {s}->https://es.example.com:443
Connection reset {:class=>"Manticore::SocketException", :level=>:error}
Configuration OK

After hours of trial and elimination, I had it narrowed down to the JVM that was running logstash. It turns out, support for TLSv2 in OpenJDK 1.7 is not enabled by default. Adding -Dhttp.protocols=TLSv2 to java startup parameters does not help either. Upgrading to OpenJDK version 1.8 worked for me. I hope this quick note helps save time for someone.

References

Debian on a Linksys WRT1200ac / WRT1900ac WiFi router

Debian

In line with their popular, hacker friendly WRT54 series routers, Linksys released WRT1200ac and WRT1900ac in 2014. These new devices are beefed up versions of their predecessor. The WRT1900ac for instance ships with 128MB flash storage and 256MB DDR3 RAM powered by an ARM compliant Marvell Armada 370/XP SoC. Marvell over time has worked with the community to provide opensource WiFi drivers. While work continues on the driver, the device is fairly stable to run production workloads. I have ~40 devices concurrently connected during business hours on 2.4GHz and 5GHz. This top of the line embedded hardware opens up interesting new possibility – say hello to McDebian

McDebian is a complete Debian operating system for the new Linksys WRT routers. The kernel along with hardware specific DTB blob is written to MTD flash which enables the device to boot. Rootfs is stored on a USB key connected to the device. Suddenly, storage space is no longer a limitation.

Why McDebian on a router ?

  • Debian maintains a wide range of packages.
  • systemd for init
  • A familiar networking stack running on the router
  • A familiar root filesystems. No difference from what runs on your servers everyday.
  • Upgrading drivers and packages is simple and straight forward
  • Easy to create consistent backups that will save the day if things do go bad
  • Chef or Puppet for configuration management.
  • apt-get update; apt-get upgrade and – Poof! All your security updates applied.

It boots fast too

Don’t take my word for it. Here’s evidence


root@MCDEBIAN:~# systemd-analyze time Startup finished in 6.891s (kernel) + 13.690s (userspace) = 20.581s

Deploying WordPress with SELinux enabled

SELinux can be a pain to work with at times, but that does not justify setting it to permissive mode or disabling it. WordPress’ popularity makes it a script kiddie’ favorite target. Other than always keeping your WordPress instance updated, you should be running httpd/Apache in a confined SELinux domain. It reduces the damage someone can do, if (at all) they manage to upload and execute files onto your webserver (ever noticed those random named, hidden binary files in /tmp owned by the user running your webserver ? )

Assumptions

  • SELinux is installed and enabled
  • WordPress is unzipped into /var/www/html/
  • This has been tested on an Amazon AMI, but should work for all distributions that support SELinux

Getting started

Lets make sure that the stage is setup correctly

Ensuring SELinux is up and running in enforcing mode

sestatus must report the below output. If it doesn’t – you either have it disabled or running in permissive mode. Don’t proceed until you have output that looks exactly like below

[ec2-user@ip-172-30-10-10 ~]$ sestatus  | grep 'SELinux status\|Current mode'
SELinux status:                 enabled
Current mode:                   enforcing

Ensuring httpd is running in confined domain httpd_t

httpd must be running in domain httpd_t. If its unconfined_t then its running in the wrong domain and rules meant to secure it won’t apply.

[ec2-user@ip-172-30-10-10  ~]$ ps uax -Z | grep httpd
unconfined_u:system_r:httpd_t:s0 root    22884  0.0  1.0 324744 11188 ?      Ss   Jan15   0:01 /usr/sbin/httpd
unconfined_u:system_r:httpd_t:s0 apache  22887  0.0  2.7 342656 28556 ?      S    Jan15   0:00 /usr/sbin/httpd
unconfined_u:system_r:httpd_t:s0 apache  23138  0.0  0.6 324876  6812 ?      S    Jan15   0:00 /usr/sbin/httpd
unconfined_u:system_r:httpd_t:s0 apache  23206  0.0  0.5 324744  6064 ?      S    Jan15   0:00 /usr/sbin/httpd
unconfined_u:system_r:httpd_t:s0 apache  23483  0.0  0.5 324744  6064 ?      S    Jan15   0:00 /usr/sbin/httpd
unconfined_u:system_r:httpd_t:s0 apache  23486  0.0  0.5 324744  6064 ?      S    Jan15   0:00 /usr/sbin/httpd
unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 ec2-user 24944 0.0  0.0 110256 644 pts/0 S+ 03:58   0:00 grep httpd

If your /usr/sbin/httpd is running in domain unconfined_t instead of httpd_t, then your context of /etc/init.d/httpd or /usr/sbin/httpd has somehow changed. Restoring the context should fix it

To restore context run restorecon

[ec2-user@ip-172-30-10-10 ~]$ sudo restorecon -v  /etc/init.d/httpd /usr/sbin/httpd

Allowing httpd to connect to remote database.

If you are hosting your database on a remote server, httpd must be allowed to connect to it.

[ec2-user@ip-172-30-10-10 ~]$ sudo setsebool -P  httpd_can_network_connect_db 1

Note: Connection to local database i.e. localhost:3306 does not need this to be set to true

Whitelisting /var/www/html/wp-content/uploads/ for write access

When uploading images and other media from wp-admin, httpd needs access to write to wordpress’ upload directory. By default all files and directories in /var/www/html are labeled with type httpd_sys_content_t. In this context files are not writeable by process httpd running in domain httpd_t. Changing this to type httpd_sys_rw_content_t will allow write/create/delete access

[ec2-user@ip-172-30-10-10 ~]$ sudo semanage   fcontext -f ""  -a -t  httpd_sys_rw_content_t '/var/www/html/wp-content/uploads(/.*)?'

To apply the change in context for directories and files in /var/www/html/wp-content/uploads/* use restorecon

[ec2-user@ip-172-30-10-10 ~]# restorecon -Rv  /var/www/html
[ec2-user@ip-172-30-10-10 ~]# ls -all -Z /var/www/html/wp-content/uploads/
drwxr-xr-x. apache apache unconfined_u:object_r:httpd_sys_rw_content_t:s0 .
drwxr-xr-x. nobody  65534 unconfined_u:object_r:httpd_sys_content_t:s0 ..
drwxr-xr-x. apache apache unconfined_u:object_r:httpd_sys_rw_content_t:s0 2015

Special note about boolean httpd_unified on CentOS/RHEL distributions

By default httpd_unified is enabled on CentOS/RHEL systems older than version 7.
This allows Apache write access to files and directories labeled with context httpd_sys_content_t. We want to make sure that Apache can only write to the directory/files we whitelisted (/var/www/html/wp-content/uploads/). Turning this off is highly recommended to control writes to the filesystem and within the DocRoot

[ec2-user@ip-172-30-10-10 ~]# sudo setsebool -P  httpd_unified  0

Debugging

  • When working with selinux, tailing /var/log/audit/audit.log is always helpful
  • Switch to permissive mode and see if something that breaks while in enforcing mode. This should tell you if SELinux is breaking things.

Exploring other interfaces exposed by selinux_httpd

A list of all boolean interfaces that are exposed by selinux_httpd can be listed using getsebool. An explanation of boolean interface is documented on selinux_httpd’ man page

[ec2-user@ip-172-30-10-10 ~]$ sudo getsebool -a | grep httpd
allow_httpd_anon_write --> off
allow_httpd_mod_auth_ntlm_winbind --> off
allow_httpd_mod_auth_pam --> off
allow_httpd_sys_script_anon_write --> off
httpd_builtin_scripting --> on
httpd_can_check_spam --> off
httpd_can_connect_ftp --> off
httpd_can_connect_ldap --> off
httpd_can_connect_zabbix --> off
httpd_can_network_connect --> on
httpd_can_network_connect_cobbler --> off
httpd_can_network_connect_db --> off
httpd_can_network_memcache --> off
httpd_can_network_relay --> off
httpd_can_sendmail --> off
httpd_dbus_avahi --> off
httpd_enable_cgi --> on
httpd_enable_ftp_server --> off
httpd_enable_homedirs --> off
httpd_execmem --> off
httpd_graceful_shutdown --> off
httpd_manage_ipa --> off
httpd_read_user_content --> off
httpd_run_stickshift --> off
httpd_setrlimit --> off
httpd_ssi_exec --> off
httpd_tmp_exec --> off
httpd_tty_comm --> off
httpd_unified --> off
httpd_use_cifs --> off
httpd_use_fusefs --> off
httpd_use_gpg --> off
httpd_use_nfs --> off
httpd_verify_dns --> off

Putting it all together

At this point, you should have a working instance of WordPress served by httpd running in confined domain httpd_t. It should minimised the damage that someone can do, if at all they manage to upload files to your server and attempt to execute them.

These simple steps should keep your WordPress instance fairly secure and the random binary files in /tmp at bay.

Further reading

  • [SELinux in Practice: DVWA Test by Positive Research Center](SELinux in Practice: DVWA Test)
  • And in response – Got SELinux? by Dan Walsh

Setting up an IPSec VPN connection to Microsoft Azure using Strongswan

Network

It took me a while to get the IPSec tunnel between Azure and Strongswan up and running. This post documents Strongswan’ configuration required to get traffic going through the tunnel

Assumptions

  • Private network segment on Azure’s side is 10.0.0.0/16
  • Public IP address of VPN getaway on Azure’s side is 1.2.3.4
  • Private network segment of instance running Strongswan is 172.30.0.0/16
  • IP address of instance running Strongswan is 172.30.2.11
  • Your pershared key is in /etc/strongswan/ipsec.secrets

Connection configuration

[francis@ip-172-30-2-11 ~]# cat /etc/strongswan/ipsec.conf
conn office-network-to-azure-southeast-asia
        closeaction=restart
        dpdaction=restart
        ike=aes256-sha1-modp1024
        esp=aes256-sha1
        reauth=no
        keyexchange=ikev2
        mobike=no
        ikelifetime=28800s
        keylife=3600s
        keyingtries=%forever
        authby=secret
        left=172.30.2.11             # local instance ip (strongswan)
        leftsubnet=0.0.0.0/0
        leftid=172.30.2.11           # local instance ip (strongswan)
        right=1.2.3.4          # vpn gateway ip (azure)
        rightid=1.2.3.4        # vpn gateway ip (azure)
        rightsubnet=10.0.0.0/16      # private ip segment (azure)
        auto=start

Installing MySQL 5.5 on CentOS 6.x

This article describes how to install MySQL 5.5 on CentOS 6.x which is not available in the default CentOS package repository. It installs the x86_64 bit version of MySQL 5.5.33-1 on a x86_64 bit machine. For i386 replace x64_64 with i386.

# Install libaio – its required by MySQL server 5.5
$ yum install libaio

# Download MySQL 5.5 installation RPMs
$ wget http://dev.mysql.com/get/Downloads/MySQL-5.5/MySQL-5.5.33-1.linux2.6.x86_64.rpm-bundle.tar/from/http://cdn.mysql.com/

# Untar the installation bundle
$ tar -xvf MySQL-5.5.33-1.linux2.6.x86_64.rpm-bundle.tar

# Install MySQL shared compact
$ rpm -Uvh ySQL-shared-compat-5.5.33-1.linux2.6.x86_64.rpm

# Install MySQL shared
$ rpm -Uvh MySQL-shared-5.5.33-1.linux2.6.x86_64.rpm

# Install MySQL client
$ rpm -Uvh MySQL-client-5.5.33-1.linux2.6.x86_64.rpm

# Install MySQL server
$ rpm -Uvh MySQL-server-5.5.33-1.linux2.6.x86_64.rpm

# Finally, start MySQL server 5.5
$ /etc/init.d/mysql start

Don’t forget to run mysql_secure_installation to secure the newly installed MySQL instance

Kannel – setting up active-passive failover SMPP gateways

This post documents setting up two SMPP gateways in active-passive failover mode i.e when the active SMPP gateway goes down, traffic is automatically send through the secondary (passive) SMPP gateway.

Prerequisites

This post assumes that you have working knowledge of setting up and running Kannel. Only relevant config is documented, rest of the config parts are snipped off for readability

Setting up active-passive gateways

Assuming the smsc-id of the active SMPP gateway is reliable_smpp_gw and passive is unrealiable_smpp_gw the following config sets up an active-passive gateway

Configuring the active SMPP gateway

group = smsc
smsc = smpp
smsc-id = reliable_smpp_gw
.
.
allowed-smsc-id = reliable_smpp_gw
preferred-smsc-id = reliable_smpp_gw

Configuring the passive SMPP gateway

group = smsc
smsc = smpp
smsc-id = unreliable_smpp_gw
.
.
allowed-smsc-id = unreliable_smpp_gw;realiable_smpp_gw
preferred-smsc-id = unreliable_smpp_gw

Hooking the gateways to an account

group = sendsms-user
username = company_1
password = compant_1_admin
name = company_1
.
.
default-smsc = reliable_smpp_gw
forced-smsc = reliable_smpp_gw

The default-smsc and forced-smsc enforce that messages submitted by user account company_1 are send through smsc realiable_smpp_gw. The passive gateway is configured to send messages for smsc-id realiable_smpp_gw and unrealiable_smpp_gw, this make Kannel use passive (unrealiable_smpp_gw) when active (realiable_smpp_gw) is unavailable. Once active gateway (realiable_smpp_gw) is back online, traffic is automatically send through the active gateway

Kannel – setting up active-active load-balanced SMPP gateways

This post documents setting up Kannel to balance across two SMPP gateways in active-active mode such that messages are set using both the gateways

Note: Kannel does not monitor the quality of service of each link. If a gateway is connected and is not delivering messages, it will continue using the gateway until the connection to the gateway goes offline.

Prerequisites

This post assumes that you have working knowledge of setting up and running Kannel. Only relevant config is documented, rest of the config parts are snipped off for readability

Setting up active-active gateways

The trick in setting up active-active gateways in Kannel is to set the same smpp-id for both the SMPP gateways.

[SMPP connection 1 config]
group = smsc
smsc = smpp
smsc-id = smpp_carrier_gw
host = carrier1.example.com
smsc-username = carrier1
smsc-password = carrier1

[SMPP connection 2 config]
group = smsc
smsc = smpp
smsc-id = smpp_carrier_gw
host = carrier2.example.com
smsc-username = carrier2
smsc-password = carrier2
Hooking the gateways to an account

group = sendsms-user
username = company_1
password = compant_1_admin
name = company_1
.
.
default-smsc = smpp_carrier_gw
forced-smsc = smpp_carrier_gw