stux @stux

Recent searches

Search options

Only available when logged in.

Michał "rysiek" Woźniak · @rysiek@mstdn.social

My former employer, @OCCRP, just went live with their new website, and it's pretty slick!
https://www.occrp.org/en

This is bitter-sweet for me.

On one hand, glad to see them have a new site, finally! The old was a mess.

OTOH: I had designed and built the infra that hosted their site through Panama Papers (arguably OCCRP's big break). It did not rely on external CDNs or "DDoS-protection" providers.

That infra is no longer in use as of today. Replaced by Google.

OCCRPOrganized Crime and Corruption Reporting Project

#DevOps #SysAdmin #WebDev

Aug 19, 2024, 07:15 PM··Web

23boosts·41favorites

**Michał "rysiek" Woźniak ·** @rysiek · Aug 19, 2024 *

Aug 19, 2024 *

Michał "rysiek" Woźniak · @rysiek

For those curious, the infra was mostly:

- a pair of back-end servers (the main site was an ancient Joomla install…), in a production / warm standby configuration;

- a couple dozen very thin VPSes acting as (micro-) caching reverse proxies; we called them "fasadas" (from Bosnian word for a façade);

- a bunch of scripts that tied it all together.

The stripped down and simplified nginx config for the fasadas lives as a FLOSS project here:
https://0xacab.org/rysiek/fasada

GitLabMichał "rysiek" Woźniak / fasada · GitLabAn nginx-based front-end caching config for WordPress sites, handling high traffic and improving security.

#DevOps #SysAdmin #WebDev

**Michał "rysiek" Woźniak ·** @rysiek · Aug 19, 2024 *

Aug 19, 2024 *

Michał "rysiek" Woźniak · @rysiek

The production / warm standby back-end servers were automagically synced every hour. Yes, including the database.

This meant that:

1. we had a close-to-production testing server always available;

2. we had a way of quickly switching to almost completely up-to-date backup back-end server in case anything went down with the production.

The set-up on these back-ends included *two* nginx instances running parallel on different ports but with same config, serving same content.

Yes, on each.

**Michał "rysiek" Woźniak ·** @rysiek · Aug 19, 2024 *

Aug 19, 2024 *

Michał "rysiek" Woźniak · @rysiek

Each fasada (i.e. reverse proxy on the edge) was configured to use *both* of these nginx instances on the currently-production back-end server.

Because everything was in docker containers, we could upgrade each nginx instance separately.

Whenever we were deploying nginx config changes or were upgrading nginx itself, we would do that one instance at a time. If it got b0rked, fasadas would just stop using the b0rked back-end nginx instance and switch to the other one.

No downtime. No stress.

**Michał "rysiek" Woźniak ·** @rysiek · Aug 19, 2024 *

Aug 19, 2024 *

Michał "rysiek" Woźniak · @rysiek

IP addresses of active fasadas (that is, ones that were supposed to handle production traffic) were simply added as A records for `occrp.org`.

This was Good Enough™, as browsers were already smart about selecting an endpoint IP address and sticking to it across requests related to the same domain.

This also meant that if an active fasada went under for whatever reason, visitors would mostly not notice – their browsers would retry against one of the remaining IPs.

**Michał "rysiek" Woźniak ·** @rysiek · Aug 19, 2024 *

Aug 19, 2024 *

Michał "rysiek" Woźniak · @rysiek

We had about 2 dozen fasadas configured, deployed, and ready to serve production traffic at any given time.

But we only kept 4-6 actually active for `occrp.org` (and some others for other sites we hosted).

The other ones were an "emergency stash".

If an active fasada did go under, we'd swap its IP address out of occrp.org A records, and add one of the currently healthy standbys instead.

If we started getting way more traffic than the current active fasada set could handle, we'd add more.

**Michał "rysiek" Woźniak ·** @rysiek · Aug 19, 2024 *

Aug 19, 2024 *

Michał "rysiek" Woźniak · @rysiek

From my experience, what brings a site down really rarely is an *actual* DDoS. Most of the time it is organic traffic spike hitting a slow back-end.

Hence:
1. microcaching
2. my exasperation with CloudFlare calling everything a DDoS

But I digress!

We did get honest-to-Dog DDoSes, some pretty substantial. When that happened we just… swapped out *all* active fasadas.

DDoS would happily continue against the 4 to 6 old IP addresses… While new visitors would get served from other nodes.

**Michał "rysiek" Woźniak ·** @rysiek · Aug 19, 2024 *

Aug 19, 2024 *

Michał "rysiek" Woźniak · @rysiek

See, when you're DDoSing someone, you don't want to waste your bandwidth on checking DNS records, now do you? You want to put everything you've got into these malicious packets.

And when you do, and the target just moves on to a different set of IP addresses, you're DDoSing something that does not matter. Have at it!

Now, I am not saying *all* DDoSes work this way.

I *am* saying that all the DDoSes I have seen against OCCRP's infra when I was there worked this way.

**Michał "rysiek" Woźniak ·** @rysiek · Aug 19, 2024 *

Aug 19, 2024 *

Michał "rysiek" Woźniak · @rysiek

The time we really went down hard was when our dedi provider (which was otherwise great!) overeagerly blackholed DDoS traffic…

…blackholing also our production back-end server.

Took us 45min to deal with this, mainly because I was out at lunch and for *once* I did not take my phone with me. While a certain @smari happened to be on vacation literally on the other side of the globe.

Dealing with this meant pushing a quick config change to the fasadas to switch to the warm spare back-end.

**Michał "rysiek" Woźniak ·** @rysiek · Aug 19, 2024 *

Aug 19, 2024 *

Michał "rysiek" Woźniak · @rysiek

What a blast from the past!

I should probably write this all up in a blogpost, with some more lessons-learned (for example: remember to microcache your 4xx/5xx errors and 3xx redirects as well).

Thanks for joining me for this ride down the memory lane!

I will now take your questions.

/end

Recent searches

Search options

Administered by:

Server stats:

Back