Creating a datacenter cluster system

26 May 2015

In today's setup, a cloud configuration is ideal for all kinds or purposes; whether you want to go for data storage, computing power or for network interaction. This tutorial of mine entails a lot of time on the user's end but is very efficient based on the circumstances in which it was built around.

(The cloud cluster, consisting of 25-30 nodes in each stack)

In tracking connections, a new conversation that initiates marks the packets with a unique identifier, which you may freely setup the IP protocol with the same take. In a nutshell, when you establish a new connection and select a particular route, the other packets in this communication are tagged with the same identifier, so they go the same path. Essentially it is the same as the very old '00's version of caching. I'm not entirely sure how this works, but it does.

IPtables

First of all we need to create the iptables configuration to set up connection marking. Here’s the relevant extract from the iptables.save file:

*mangle :PREROUTING ACCEPT [0:0] :POSTROUTING ACCEPT [0:0] :OUTPUT ACCEPT [0:0] :INPUT ACCEPT [0:0] [0:0] -A PREROUTING -i eth1 -j CONNMARK --restore-mark [0:0] -A PREROUTING -i ppp1 -m conntrack --ctstate NEW -j CONNMARK --set-mark 1 [0:0] -A PREROUTING -i eth0 -m conntrack --ctstate NEW -j CONNMARK --set-mark 2 [0:0] -A POSTROUTING -o ppp1 -m conntrack --ctstate NEW -j CONNMARK --set-mark 1 [0:0] -A POSTROUTING -o eth0 -m conntrack --ctstate NEW -j CONNMARK --set-mark 2

-i = –in-interface and -0 = –out-interface

These rules set a mark depending on which interface is used. These changes happen in the mangle table.

Packets going in or out the WAN via ppp1 or eth0 which are a new connection are marked with a 1 or a 2 depending on which interface they use. The decision about which route to use is done in the rules which we will see later. Any packets coming in to eth1, so from the LAN, have their marks restored on the way in so they can be dealt with accordingly.

Now let’s have a look at the filter table:

*filter :INPUT DROP [0:0] :FORWARD DROP [0:0] :OUTPUT ACCEPT [0:0] :LAN_WAN - [0:0] :WAN_LAN - [0:0]

[0:0] -A INPUT -i lo -j ACCEPT [0:0] -A INPUT -i eth1 -j ACCEPT [0:0] -A INPUT -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT

[0:0] -A FORWARD -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT [0:0] -A FORWARD -i eth1 -o ppp1 -j LAN_WAN [0:0] -A FORWARD -i eth1 -o eth0 -j LAN_WAN [0:0] -A FORWARD -i ppp1 -o eth1 -j WAN_LAN [0:0] -A FORWARD -i eth0 -o eth1 -j WAN_LAN

## Clamp MSS (ideal for PPPoE connections) [0:0] -I FORWARD -p tcp --tcp-flags SYN,RST SYN -j TCPMSS --clamp-mss-to-pmtu [0:0] -A LAN_WAN -j ACCEPT [0:0] -A WAN_LAN -j REJECT

The default policy is set to DROP, so any packet not matching one of the rules are dropped.

INPUT applies to packets which are bound for the router itself. Packets from the local interface are allowed, and packets from eth1 (the main LAN) are also allowed.

FORWARD applies to packets which are passing through the router on their way somewhere else. Packets which are known to be part of an already in-progress session are allowed. Packets are then categorised as LAN to WAN or WAN to LAN and dealt with by the rules LAN_WAN or WAN_LAN, getting accepted and rejected respectively. All this boils down to LAN clients using the Raspberry Pi as a router and so having their packets forwarded are allowed out and packets coming in from the internet are rejected, the exception being if they are part of an on-going connection.

Clamping MSS to MTU deals with a particular issue with using PPPoE connections where the MTU can’t be the usual 1500 bytes. Because a lot of ISPs block the ICMP messages that would normally deal with asking the client to send smaller packet sizes we use this handy trick to make sure that packets can go out unfragmented. If you find that some web pages are slow to load and others are not, then try switching this on. If you’re only using upstream ISP provided routers you probably don’t need this.

Lastly in iptables we enable SNAT or masquerading so that connections out to the internet appear to come from a valid internet routable IP address not our LAN IP address:

#SNAT: LAN --> WAN [0:0] -A POSTROUTING -o ppp1 -j SNAT --to-source 212.159.20.70 [0:0] -A POSTROUTING -o eth0 -j SNAT --to-source 192.168.1.253

Routing tables

We’ve configured iptables to add a mark to traffic depending on which WAN interface it is going in or out of. But this is only marking the packets, there is no logic to make sure that packets of the same mark use the same route. To make this happen we use ip rules.

First create three new routing tables by editing /etc/iproute2/rt_tables. I’ve added this to the bottom:

1 plusnet 2 talktalk 3 loadbal

Now we add a default route to the first two of those tables:

ip route add default via $PPP_GATEWAY_ADDRESS dev ppp1 src 212.159.20.70 table plusnet ip route add default via 192.168.1.254 dev eth0 src 192.168.1.253 table talktalk

$PPP_GATEWAY_ADDRESS is set when the PPP session is established and changes. We can look at ways to find that address later, but for now just substitute the “P-t-P” IP address from “ifconfig ppp1” or whatever your ppp interface number is, or in the case of an ISP-provided router, the LAN side IP of that router.

This is simply creating a routing table with the name of the ISP that will be used and a default route which can find its way to the internet for that ISP.

Next we create the loadbal routing table which is a combination of the previous two:

ip route add default table loadbal nexthop via $PPP_GATEWAY_ADDRESS dev ppp1 nexthop via 192.168.1.254 dev eth0

which is the same idea as we used in the old route caching days, a round-robin route which flicks between the two available routes to the internet.

ip rules

We’ve now created the iptables entries to track and mark traffic from each of the two ISPs and add some basic firewalling and IP masquerading. We’ve also created a routing table for each ISP and a load-balancing table which splits the traffic between the two ISPs.

Now we need to create some rules to govern which of the routing tables is used for a particular connection. The commands to do this are:

ip rule add from $PPP_IPADDR table plusnet pref 40000 ip rule add from 192.168.1.253 table talktalk pref 40100 ip rule add fwmark 0x1 table plusnet pref 40200 ip rule add fwmark 0x2 table talktalk pref 40300 ip rule add from 0/0 table loadbal pref 40400

The rules are matched in numerical order based on preference and once a rule matches that’s it. The first two rules make sure that traffic from the routers uses the correct table.

The important rules are the last three. Traffic which has been marked “1” will always use the plusnet routing table, traffic marked as “2” will always use the talktalk routing table. This ensures that all traffic which is part of an on-going conversation will always use the same router out to the internet, and so always come from the same IP address.

The last rule only matches traffic which is not already marked i.e. new conversations. This routing table, as can be seen in the previous section, has a multi-path route to balance traffic between the two routes out. Once a conversation is established the IPtables conntrack rules will mark the traffic and so one of the two fwmark rules will match.

Now delete the main default route so that the above rules don’t get bypassed with a route in the “main” table:

ip route del default

And that’s it. You should now have a router which splits the traffic fairly evenly across two internet connections and keeps tabs on which packets should go out of which routers. I’ve had this running for a month or so now, and it seems to be working fine. I’ve had the Pi lock up a couple of times, but I think that’s related to the USB gigabit ethernet adapter.

Spreading interrupts across cores

Network cards have queues for tx and rx. Higher end cards will typically have more queues, but on the Pi the on-board NIC (which is actually connected via USB) has one for tx and one for rx, as do the VLAN interfaces and the PPP interfaces. Each of these queues has a CPU affinity and it seems that by default the queues all use the same CPU core.

When downloading an ISO with BitTorrent and the load-balancing set up I was able to achieve just over 10 MBytes a second. But the Pi became really unresponsive. Looking at top showed one CPU core maxed out in soft interrupts:

By adjusting the CPU affinity to spread these IRQs across multiple CPUs I squeeze out a tiny bit more network throughput, but more usefully the Pi remained responsive under heavy load.

The commands I used to do this are:

echo 1 > /sys/class/net/eth0/queues/rx-0/rps_cpus echo 1 > /sys/class/net/eth0/queues/tx-0/xps_cpus echo 2 > /sys/class/net/eth1/queues/tx-0/xps_cpus echo 2 > /sys/class/net/eth1/queues/rx-0/rps_cpus echo 4 > /sys/class/net/eth1.1000/queues/tx-0/xps_cpus echo 4 > /sys/class/net/eth1.1000/queues/rx-0/rps_cpus echo 8 > /sys/class/net/ppp1/queues/tx-0/xps_cpus echo 8 > /sys/class/net/ppp1/queues/rx-0/rps_cpus