Camp COVID - A Recap
let me first say, on behalf of the Recon team, we cannot thank the community enough for joining us last week.
it was the first time we've ever run an event like this: 100% virtual, remote, and open to anyone and everyone.
it was a huge success, and we got incredible feedback. people were friendly, engaged, and having fun until the end.
throughout the day, we noticed a lot of you in the slack channels helping your fellow teammates, and even your opponents, when they got stuck in the game. it was pretty cool to witness bringing everyone together like that.
GLOBAL participation: y'all, this is epic.
the fact that we were able to share this event internationally meant the world (hah, see what i did there) to us.
just look at all! these! colors! 😍 brian was able to take all of the IP data from CTFd and generate this map with MaxMind. check out his blog post that explains all of the nerdiness that went into generating it.
we ran 7 scenarios that made up the 385 challenges mentioned above.
our team worked really hard all week to ensure that they went off without a hitch. i always say that this part deserves a blog on its own, because it absolutely does.
so much research and development goes into the scenario dev side of OpenSOC that participants never get to see (for obvious reasons).
THERE IS NEVER ENOUGH TIME.
if you are unfamiliar with how our scenarios work, check out this break down on the oldest and longest running scenario, "Urgent IT Update!!!". get all the nerdy details (read: awesome details) from eric.
like every event we run, we inevitably have hiccups along the way. huge live environment, lots of moving parts, hundreds of people beating it up--things happen.
BUT. it makes us unbelievably happy to be able to say that this was the smoothest event we've ever had, and the largest.
graylog went "down" twice (down in quotes because not all nodes went down at the same time), but we now know exactly why.
and it was because of our old friend,
http_thread_pool_size, that we ran into last year.
we had roughly the same player count as we did at DEF CON last year, but DEF CON was stretched across 3-4 days.
thursday, it was 650 people hammering on our systems, all within a 12 hour window. wildly different load.
and while we had scaled all the things up with regards to CPU/memory, i naively had only quadrupled this number, instead of what i should have done, which was multiply it by 16. the first time it happened, i bumped it to 128. that held for a brief period, but it didn't take long for it to get exhausted again.
bumped it to 256, and it was smooth sailing for the rest of the event. lesson learned. and we'll start scaling horizontally now that we have a much better baseline for that kind of load.
we could not have asked for better performance testing.
this one caught us by surprise at a weird time. it was happy all day, with a couple brief periods where the viewer service caused memory consumption to get pretty high when there were so many people in there, but it corrected itself, and all was well. i had already upped resources on this system, so i was less concerned.
what i was not prepared for, was elasticsearch being exhausted. elasticsearch behind moloch is a separate system, and smaller than what we have sitting behind graylog, since graylog gets punished a lot more during these events.
typically, all of the scaling efforts we put towards elasticsearch are because of so much data being ingested, not the other way around. and, as i mentioned, usually graylog's elasticsearch feels the brunt of this. not moloch's.
but, with this kind of turnout, and with so many people doing so much awesome querying, elasticsearch was struggling to breathe, and at around 730PM EDT, it took a nosedive, and so did moloch.
The es_rejected_execution_exception[bulk] is a bulk queue error. It occurs when the number of requests to the Elasticsearch cluster exceeds the bulk queue size (threadpool.bulk.queue_size). The bulk queue on each node can hold between 50 and 200 requests, depending on which Elasticsearch version you are using. When the queue is full, new requests are rejected.
so, TL;DR, we added more juice to elasticsearch behind moloch. problem solved.
this might have been the biggest surprise of all for me. and it was also my biggest fear going into this event.
up until a few months ago, we relied heavily on kolide as one of our DFIR systems in our environment(s). but, we had a lot of issues with it at DEF CON.
from DEF CON, excerpt:
a half a dozen users would query a single windows system, and the osquery agent on it would come to a halt. which meant no one could hunt on it. we crafted a fix to clean and restart those agents on a schedule, and that worked for the duration of the event(s). this is by no means a long term fix, but it worked well enough in this case.
so, to avoid this going forward, we built our own osquery frontend.
while queries to endpoints still get queued, they self-correct and we don't have to restart services, or clear out any files, or kill any processes. we simply wait for it to finish the requests in the queue and carry on.
this only became an issue when the same system was getting hammered by everyone due to everyone simultaneously working through a particular scenario.
caveat: i've never built an application for this many people, or built any application to be used at scale. to reiterate, putting this in front of 600 people to do their worst was terrifying. i had no idea what to expect, but it held up and i was stoked.
now we just need to build in some more bumpers for folks still struggling with the query language. and osquery in general. and for the turds doing this kind of thing:
some people just want to watch the world burn.
we have had so many scoreboard woes over the years. platforms that scaled, but basic features were broken, or imports/exports were unreliable. platforms that had all the things, but didn't scale.
we decided we wanted no more of that.
we finally caved and got a subscription with CTFd to host our scoreboards going forward. their team has been super responsive anytime we've had an issue, and last thursday went ALL DAY without a single "THE SCOREBOARD IS DOWN!!!"
it was beautiful. so, thank you, CTFd team. nailed it.
thank you, again
we hope you all had as much fun as we did (hopefully more). we love running OpenSOC--it is really a labor of love for this team. and it truly takes the entire team, especially for an event this large.
we thrive on giving back to a community that has provided us with so much of what we use and rely on.