Hello, everyone. Today I'm going to be talking about building a real user monitoring system
with open source tools. And before I dive in, a bit more info about me. My name is Tvetan
Stojchev. I work on MPALS in Akamai. MPALS is a real user monitoring system. It's a commercial one
and it serves, I think, thousands of customers, Akamai customers. And my hobby project is basic
run, which will be the focal point of this presentation. And really before I dive in,
I would like to share a bit more about some of my other personal activities. Every December,
I make an attempt to publish at least one blog post on the web performance calendar.
That's the best place for the web performance to see us in the year. And the other thing is,
sometimes I do a street art. So that's my safety net plan. If chat GPT takes over the world,
I still will have something creative to do. Yeah. So let's now move on to the important part of
the presentation. And let's take a look how, in general, how a real user monitoring system
would look like. So we will need to have something in the browser, ideally a JavaScript agent,
that will read some data and it will send it to a server. We will store it somewhere in a storage
and later we will analyze this data. And here we just see the most trivial piece of JavaScript.
This is the bare minimum that will do the job in the browser. So this piece of JavaScript will
read what is the current page URL. And it will create a one by one image,
one by one pixel image. It will append it to the HTML and this actually will create,
will force the browser to create a request to this endpoint. And here is a really very simple
code snippet on the server side, how the code will look like when we need to intercept this
data and to store it somewhere. So here is our route where the browser will hit this route.
We will read the query parameters, headers, headers, and even we will put a timestamp in
the structure and then we'll save it to JSON on the file system and we will return back to the
browser a transparent GIF. And eventually we will, on the next stage, when we want to analyze the data,
we will go through all the files and we can create a summary for the page visits. And for
example here in this example we can see that category four was the most visited page with 427
page visits. So that's the theory. And in 2019 I started as a hobby basicram and that's the
initial version and the components that I used to build basicram. So on the browser side I started
using an open source library called boomerang.js which collects a bunch of interesting metrics
from the browser and sends them to a server. On the server side I used nginx and some PHP
application code. And for storage I used mysql and for analyzing the data I still used php and
for reading the data and serving it to a frontend and on the frontend I used plot.ly.js for visualizations.
And I ended up with something like this. It actually, it's really interesting after five years it's
still running. So if you want to give it a try this is the first version of basicram. You can
visit demo.basicram.com and you can play with this UI. Now about boomerang.js. Boomerang.js was
started 2011 in Yahoo by Philip Tellis who happened actually to be now a colleague of mine.
And currently the library is maintained by MPAL's engineering team in Akamai.
And as I mentioned the library collects a bunch of interesting metrics like the
interesting ones for core web vitals, lcp, cls, fid. It also can track some session data. It can also
help users of the library to create a timeline of all the clicks over the page like cycle of a visitor.
And also it has more modern ways to send the data to the server like more modern
JavaScript APIs fetch, XHR and send beacon. And it can be found on GitHub in akamai slash boomerang.
On the back end side that's again like very theoretical but what actually was happening I
still was every request that I was getting to my server I was saving it in a file.
And then periodically I was running a cron job which here I just marked as a that's kind of a
too much overhead and you understand why later. But I was running a cron job that was reading all
these collected files and I was creating one big batch and I was inserting this data in my SQL.
I also ended up with a database model that's very biased. My previous background was I was building
Magento online shops and if somebody ever worked with Magento we'll probably recognize some patterns
about all these foreign key relationships and this main table that's in the center of everything.
I had to put bunch of indexes here and again this created a bit too much overhead for I would say
also on the code level like on the application level but also for me as a maintainer. So I had to
take care about again every time when I wanted to introduce some new dimension I had to create a new
table and to put a bit more code for inserting the data and it was just too much maintenance for me.
Also I had to take care about not duplicating some of the data here and this is because of the nature
of PHP. PHP is kind of a stateless so every request is independent from the other request
so I couldn't keep some things in memory. If I could keep some references in memory I probably
could optimize some things here. And actually question to the audience do you have an idea what
this query actually would produce? What's the idea behind this query? Maybe.
I can say that.
Bucketing? Yeah it's a bucketing for a histogram and I also had to write
a lot of kind of queries that are in the data scientists type of queries which also was for
me introduced a bit of a learning curve but the system had really had coded in itself such type of
queries and this here represents a histogram of the time to first byte. Like we can see that the
median is around 1.8 seconds. It's a bit skewed. And with the help of plotly the JavaScript library
for visualization I could create such panels for distributions for operating systems and
mobile operating systems and I also could write such bar charts that were showing kind of the
relationship between the first byte and start render time. And yeah reference to the plotly
it's a really cool library really rich and you can create a bunch of panels with it.
But I found myself like having difficulties and probably not focusing at the right place. So
as I say when you build a real user monitoring system you need to change your mindset and your
queries should be more like in data scientist style. And the PHP were out and the ORM that I was
using I was using doctrine. It's not really meant for writing complex queries from this fashion. So
I found myself writing my own query builder and using doctrine when convenient and using my
query builder when convenient but this was again too much maintenance for just for a single
maintainer of a project. I also wanted to introduce user management and permission system
but again with my limited time and working from time to time on the project during the weekends
this was just again too much it was not the right focus. I wanted just to show some meaningful data.
And yeah I really love plotly but I just ended up with large blobs of JavaScript here and there
and it was more like more and more plotlier. I wanted to see data not writing JavaScript.
So I took a break I believe half a year and I focused on my main job but from time to time I was
doing research and I was reading some other articles about time series databases and I started
exploring some of the open source available open source systems for visualization. So I
kind of rebuilt the complete backend. I still kept boomerang but I rewritten the server site so I
completely removed nginx and PHP and I used golang. I replaced my SQL with click house
and I replaced all the custom code all the PHP and plotlier with grafana.
And again if you would like to play with the current version of basicram that's what I ended up with
that's actually a let's say a bit of rebranded version of grafana with the specific basicram
dashboards and settings. So if you would like to play with it just visit this address and write
calendar calendar as a username and password. So where golang was really useful, golang it's
just different paradigm it's a different idea compared to PHP. Golang you can compile a single
binary that and in this single binary everything that I needed was packaged inside the binary so
it's just a process that you run on the server and it has everything inside and this allow me to
replace the actually to get rid of nginx because golang has a package for built in
htp server and yes PHP also has a package for PHP for htp server but you need to do a lot of
work arounds to make it working because just not native in this is not native in PHP.
I also could leverage the existing click house package in golang for interacting with the
click house database and I took advantage of asynchronous insert which saved me a lot of
I could get rid of some code that I had in the PHP version of the basicram.
Also in golang it was very easy to create a backup mechanism for all the data that was
flowing through the system because in golang I could easily keep stuff in memory I didn't have
to write each request to the system on a file and later to batch it and bundle it. I was just
keeping these data points and requests in for example in memory for 10 minutes and I could
just flush them on the hard drive and compress them and this was again really really easy few
lines of code and just natively coming in golang and also for some cases where I needed encryption
again in golang there is a let's encrypt package it's a third-party package but I could easily just
spin a server and say okay I want to use let's encrypt and I was getting secure connection to
this server with it it really reduced the operation the effort on the operation site.
I also took advantage of a gip lookup library which is using the maxmind database and why
I needed this because in a real user monitoring system you would like to see from which city a
visitor visited the website or from which country visited the website this is really helpful when
you want to create a report and when you want to figure out maybe in which country is your website
is slow. I also took advantage of another library about user agent parsing so this library helped
me to extract important information about the browser name the operating system and the user
agent family and I also started using my new favorite database Clickhouse. So you remember
where I say that I was doing a lot of work when I was like batching and bundling everything and
inserting these big batches in MySQL. Clickhouse comes with a really cool feature called asynchronous
inserts so Clickhouse allowed me every time when a request reaches my back end to immediately to
create an insert to Clickhouse and Clickhouse was internally batching this and it was deciding
where it needs to insert in the database so this was not this helped to like not reach some performance
botonics. Another thing that I could do with Clickhouse so here you see I have seven tables
in the old setup with MySQL but in Clickhouse I actually end up with two tables and I actually
could I actually could have one table but I needed this table for showing the host names in the filters
in Grafana and just Clickhouse or in general when you work when we work with time series
the main idea is that here the the the data is normalized I try to really build
to build a user monitoring system in the fashion of a webshop right which is really the wrong idea
but when we use time series database the idea is that the data you can just throw your data into
this database you you have one large fat table and you throw a lot of data and you don't really
need to consider duplication of the data for example here we have this filter's device type
and I don't have a foreign key here to another table where I keep references to all the device types
I just can insert and insert the same string over and over again desktop desktop desktop and this
database will be completely fine with it it will compress the data internally and I won't experience
any performance bottlenecks when I filter by this field and here is my other favorite feature in
Clickhouse it's called it's called low cardinality data type and this data type is really convenient for
columns where the distinct values in this column some less less than 10 000 because this it's
optimizing eternally and it's the the where conditions and the filters in this case are much
much faster when we use low cardinality we if if we have more than 10 000 distinct values we probably
need to go again to something like this and to start introducing additional dimension tables
also so here in left is really uh I would say insane I even don't know how I created this I still
I'm really surprised with myself and you we cannot zoom in here but this was a process where
it included querying my my secure database and I had some application code and I had bunch of
cron jobs and this was trying to guess and to find out all the sessions that bounced and what was
the duration of the sessions it was just really complex and for example to to calculate the bounce
rate with my new setup in Clickhouse I just could use such a query again I got a bit help with this
query I don't completely understand it but it does it actually it works and it's much more
simple and much much more it makes my it makes basic run much much easier to maintain and with
with this query I could actually create easily this correlation between bounce rate and epic
and metric and in our case this is time to first bite
also I want to say that open source is not only about how great is the open source
product that you work with but also the community is very important and that's why
I also stick to Clickhouse they have really great slack community and every time when I ask a question
I I can say that in the matter of a few hours I get really a good response for example here I'm
asking hey I I wrote this query but I feel that it's not optimal I'm not a SQL expert and here
another expert actually suggested a better way how to write this query it's it's shorter and it's much
more performant and also probably this is the first and probably the last database channel
YouTube channel that I will be subscribed but I'm actually subscribed to the Clickhouse YouTube
channel and they have really really good videos like they have every month they have like a release
party video where the the Clickhouse team is showing the new features and there are a lot of
good tutorials so it's it's really welcoming for for beginners and they say you get support from
the community and there is really good there are really good materials out there so now let's look at
the user interface Grafana earlier I mentioned that I was about to start in my in the first
version of basicram I was about to start implementing my own my own user management and login and
authentication and Grafana this comes out of the box so it's much easier to add new user to give them
different permissions and again this is just the code that I would never want to write again
right and in this repository I bundle the basicram version of Grafana it has some customizations
also another benefit of Grafana is it's very easy to model the data and what you want to see
in the in the visualization panels so for example here we have we can define filters
we can have a preview of our data we can also configure different things for example here I'm
just showing how I can configure different colors for the different thresholds and also there is an
SQL editor so when I write the SQL here this Grafana uses this SQL to fetch the data from Clickhouse
and here are other panels that I took advantage of here is the world map so I could it was really
literally plug-and-play I just configured few stuff and I say it from where to read the data about
the countries Grafana also has a third-party plugin for plotly so I still there were scenarios
where I wanted to build some more complex panels and with this panel I could actually build this
one which is showing how the device the screen size is the width of the screen size is distributed
yeah time series this is the kind of the most the default view in Grafana and also I could present
the data in a table this is very good when you want to explore your own data
also Grafana comes with different data sources and of course Grafana needs to know how to talk to
Clickhouse in my in basic realm I'm using a data source developed by company called
Altinity but there is also another one developed by actually official by Clickhouse right
yeah and just to say that all these things that I'm showing all these dashboards that
are built in in the basic version of Grafana
everything there is actually under version control so it's not just that I created a dashboard in
Grafana instance and exported it and save it somewhere this I have this repository where I have
the configuration for each of the panels that I'm maintaining and then this makes makes it much
easier when I need to change something or to add a new panel and I can go through the history and I
can understand what actually change if something has to be reverted yeah for example here we are
seeing how I keep this row as it's a templated SQL but this is how it's presented then when we look
in Grafana and again out of all this source code configuration that I keep for the dashboards
I'm building a docker image where we here we have a bit of branding work just removing
some things from the default or rewriting some things from the default Grafana image here we are
installing the plugins that we need for our setup and here we are importing all the configurations
for the dashboards and the data sources and what I found over time when I spoke to different people
who asked me about three user monitoring systems very often the conversation was just ending when
when I was explaining yeah you need to run this component on this server and you need to run this
component on this server and you don't need to run this component on this server and it looks like
their use case the use case of the people that I spoke to was actually not requiring them to scale
they had pretty small websites or web shops and I work on something a bit more monolithic
it's called basicrum o in one and the idea is that probably again probably it sounds from
engineering point of view a bad practice but it actually could be really practical thing
the idea is to run everything on one big box and I believe for 20 euro a month this could be actually
hosted somewhere and I tested it it can handle 1.5 million page views a month and the idea here is
we introduced traffic which is a proxy it stays in front of this folder components and it's helping
me for SSL termination and routing request because some of the request needs to go to the data
collection part and other request needs to go to the grafana to the part where we analyze the data
so this is really convenient it's really easy for people if you just want to give it a try
and a few takeaways I just have to say that a real user monitoring system is fairly complex system
and you need to learn to train yourself you want to develop one you need to you need to
learn more about on the data collection site where how the data is collected from the browser
how to visualize the data and it will be a bonus if you learn about how time series databases work
again choosing the right database to solve the right problem is the key
and it's great when when you can transfer a problem from the application on the database
layer it just saves a lot of time and yeah grafana could save a lot of time and effort even I
recommend it even if you still want to build your own front end maybe just start with grafana to
play with the data and to display something it literally will save a lot of time and I got a
signal that I run out of time but you can catch me up all right I can take one question
so in this project we don't really keep any IP addresses so for example that I guess that's
what we consider like user data or yeah so the backend doesn't store any personal data
in this case so by default it's using the IP address only to identify the country and the
the city but it's not storing the IP address after that and I know that on the data collection site
from the boomerang library I'm not sure if it's on the boomerang library has also like
part of the boomerang source code is private but I know that for PCI compliance reasons it has
special parts that try to avoid collecting stuff around the user sometimes the user may
put for example a credit card number and this could be actually collected by mistake so this
library also tries to avoid collecting critical user information do you mean to consent
the cons
so the library comes with a special snippet that's a loader snippet so you can have your own
callback so you can you can call this loader snippet only after a cookie consent so it's possible
you