[00:00.000 --> 00:10.280]  The name of my talk is Postgres Observability.
[00:10.280 --> 00:17.600]  My intention is to show you what's great about Postgres and how it integrates well with
[00:17.600 --> 00:21.520]  observability, but also where some of the problems are.
[00:21.520 --> 00:27.560]  Obviously, in 25 minutes, it's not going to be an exhaustive presentation of all of the
[00:27.560 --> 00:31.940]  metrics in Postgres, but maybe I can give a bit of an introduction.
[00:31.940 --> 00:34.000]  So first of all, my name is Gregory Stark.
[00:34.000 --> 00:41.120]  I work for Ivan, which is a, so I work in the open source programs office, contributing
[00:41.120 --> 00:42.440]  to Postgres.
[00:42.440 --> 00:49.040]  Ivan is a data infrastructure hosting company.
[00:49.040 --> 00:56.840]  We host Postgres, but we also host a range of other data services, including some observabilities
[00:56.840 --> 01:07.840]  that it's all open source software, and we contribute back to the projects that we sell.
[01:07.840 --> 01:17.640]  So I'm sure in this room, most people have seen the cliched three pillars of observability.
[01:17.640 --> 01:23.680]  In a modern software, what people expect are their logs to be structured so they can send
[01:23.680 --> 01:32.400]  it to some sort of index, something like OpenSearch, some sort of indexed aggregate log system.
[01:32.400 --> 01:37.200]  They expect time series database to hold all their metrics with labels and well-defined
[01:37.200 --> 01:38.200]  semantics.
[01:38.200 --> 01:41.600]  They expect distributed tracing.
[01:41.600 --> 01:47.080]  Postgres is not so much of a modern, it's still actively developed and has modern relational
[01:47.080 --> 01:53.640]  database features, but for things like this, Postgres is going on almost 30 years now.
[01:54.600 --> 02:04.280]  Our logs, our metrics, our tracing tools predate most of these modern distributed system concepts.
[02:04.280 --> 02:09.880]  So what these look like in Postgres is we have very good logs, they're meant for a human
[02:09.880 --> 02:12.280]  to be reading in a text file.
[02:12.280 --> 02:17.920]  So we actually support JSON logs, but the actual error message, the actual log message
[02:17.920 --> 02:22.600]  will just be a string inside that JSON struct.
[02:22.600 --> 02:30.080]  All the JSON structured information, labels and so on are the metadata about the log
[02:30.080 --> 02:36.520]  line, things like process ID, session ID, the actual like table name being mentioned
[02:36.520 --> 02:43.960]  in the error, the actual, well actually current user is one of those columns, but if the error
[02:44.000 --> 02:49.320]  message mentions a user name or a table name or an index, it's just going to be part of
[02:49.320 --> 02:52.240]  the string.
[02:52.240 --> 02:56.040]  There's tons of metrics in Postgres, and I'll go into more detail, I'm mainly going
[02:56.040 --> 03:03.400]  to be talking about metrics here, but they're in SQL, they're not in like Prometheus exposition
[03:03.400 --> 03:06.680]  format or open metrics or anything like that.
[03:06.680 --> 03:12.880]  And then there's explain plans are basically a tracing tool, but it's meant for a human
[03:12.920 --> 03:18.040]  to be investigating on a single system, it doesn't integrate into any sort of distributed
[03:18.040 --> 03:20.600]  tracing tools.
[03:20.600 --> 03:27.040]  So I want to spend a little bit of time showing you what like the metrics in Postgres look
[03:27.040 --> 03:33.240]  like because it gives you, I can't show you all of them, there are hundreds and hundreds,
[03:33.240 --> 03:39.680]  probably thousands, but I want to give you a feel for like the kinds of in depth metrics
[03:39.680 --> 03:44.520]  that Postgres does provide.
[03:44.520 --> 03:51.800]  It does give you, there's a whole component inside Postgres whose job is to track metrics
[03:51.800 --> 03:59.240]  about your objects, your tables, indexes, functions, things like that.
[03:59.240 --> 04:06.960]  So those are mostly quantitative metrics, cumulative counters that are counting how
[04:06.960 --> 04:11.920]  many times events have occurred or how many seconds have elapsed while doing operations
[04:11.920 --> 04:15.280]  on your table.
[04:15.280 --> 04:25.280]  There are also other kinds of metrics that don't map so well to quantitative Prometheus-style
[04:25.280 --> 04:31.360]  metrics, and I'll show you, if I have time, I'll try and show a bit of why those are difficult
[04:31.360 --> 04:41.640]  to map to time series databases like Prometheus.
[04:41.640 --> 04:48.560]  The thing to understand is Postgres exposes these things through SQL.
[04:48.560 --> 04:54.920]  The way you access these metrics is by logging into the database and running SQL queries.
[04:54.920 --> 05:06.720]  So for example, this is pg.database, I realize you probably can't read it very well, but
[05:06.720 --> 05:12.680]  if you can, hopefully if you can see the general shape of it, I'll describe it, there's one
[05:12.680 --> 05:16.680]  line for each database inside the Postgres cluster.
[05:16.680 --> 05:21.160]  So there's a database called Postgres, there's a database called template one and database
[05:21.160 --> 05:25.200]  called template zero and another database with my username Stark.
[05:25.200 --> 05:31.600]  And each row of this table, it's actually a view, there's no storage attached to it,
[05:31.600 --> 05:37.360]  it's a dynamically generated table, a virtual table, say.
[05:37.360 --> 05:41.440]  Each row represents the metrics for that table.
[05:41.440 --> 05:47.320]  So it shows you the number of backends that are connected to that database, I think I
[05:47.400 --> 05:51.320]  said table before, I meant database, it shows you the number of backends connected to that
[05:51.320 --> 05:59.600]  database, that's a gauge in Prometheus parlance, you can go up and down, the number of transactions
[05:59.600 --> 06:11.280]  that have committed on that database, I think that's since the database start up actually,
[06:11.280 --> 06:14.120]  the number of transactions that have rolled back, the number of blocks that have been
[06:14.200 --> 06:19.960]  read on that database, the number of blocks that were hit in the shared memory cache,
[06:19.960 --> 06:24.680]  these are all, and actually this is truncated, there are many, well, there's a good number
[06:24.680 --> 06:29.880]  of more columns as well, but the key point is there's a row for each database and there's
[06:29.880 --> 06:32.480]  a bunch of metrics about that database.
[06:32.480 --> 06:39.040]  And then you can go into more detail, there's similar tables for, there's similar views
[06:39.040 --> 06:43.480]  to show you metrics about your tables, the number of sequential scans that have occurred
[06:43.480 --> 06:50.680]  on a table named PG bench branches in this case, and PG bench accounts, PG bench colors,
[06:50.680 --> 06:55.200]  so the number of sequential scans on that, each row is a table and this is showing the
[06:55.200 --> 07:01.120]  number of these various operations like sequential scans, tuples read, index scanned, for each
[07:01.120 --> 07:05.040]  of those tables.
[07:05.040 --> 07:10.720]  So in like Prometheus or other time series database world, you would probably want to
[07:10.760 --> 07:15.920]  make the relation name, the table name here, a label on your metric, you probably also
[07:15.920 --> 07:20.720]  want the schema name as a label, you might want the ID number, which is that first column
[07:20.720 --> 07:25.480]  as a label, you actually have a decision to make there, do you want the time series to
[07:25.480 --> 07:30.080]  be tied to the ID number or the name, so if you rename a table, is that a new time series
[07:30.080 --> 07:32.400]  or not?
[07:32.560 --> 07:41.560]  So the tool in Postgres world, like that mapping, those decisions are made somewhere, where
[07:41.560 --> 07:49.680]  the mapping has to be made is in an agent that connects to the database, runs SQL and
[07:49.680 --> 07:58.360]  exposes the data in Prometheus exposition format or open metrics, so the agent, the
[07:58.400 --> 08:05.240]  standard agent for Prometheus is called Postgres exporter and it has built in queries for
[08:05.240 --> 08:10.720]  these things, it has built in ideas about what the right labels are for the metrics
[08:10.720 --> 08:18.600]  and how to map these data types, these are actually all 8 byte integers which need to
[08:18.600 --> 08:27.440]  be mapped to floating point numbers for Prometheus, so like there's all kinds of hidden assumptions
[08:27.720 --> 08:34.960]  that Postgres exporter has to be making to map this data to the monitoring data, the
[08:34.960 --> 08:42.320]  data for Prometheus or M3 or whatever time series database you're using.
[08:42.320 --> 08:47.760]  I don't have time to go into like how you would use these particular metrics to understand
[08:47.760 --> 08:58.400]  your, like how to tune your database, but one point is, the way Postgres, like these
[08:58.400 --> 09:04.240]  metrics were originally designed, you were imagined to have a DBA logging into your database
[09:04.240 --> 09:09.560]  querying specific rows with a word clause, maybe doing calculations where you divide
[09:09.560 --> 09:15.040]  one by another to find out the number of tuples each sequential scan is returning and things
[09:15.040 --> 09:21.360]  like that and obviously in a modern observability world what you're actually going to do, what
[09:21.360 --> 09:26.800]  Postgres exporter actually does is just do select star with no where clause, takes all
[09:26.800 --> 09:32.560]  this data, dumps it into a time series database and then you do those same calculations but
[09:32.560 --> 09:40.320]  you do them in from QL or whatever the equivalent is in your observability tool and that gives
[09:40.320 --> 09:46.640]  you the same kind of flexibility but now you can look at how those metrics relate to
[09:46.640 --> 09:51.840]  metrics that came from other sources so you get a more global view, you can aggregate
[09:51.840 --> 10:00.640]  across multiples databases, you can aggregate across your Postgres databases and other systems.
[10:00.640 --> 10:08.680]  So a lot of the flexibility here that these are designed to give you is no longer relevant
[10:08.680 --> 10:14.200]  when you're just doing a simple select star and dumping it all into Prometheus.
[10:14.200 --> 10:24.680]  Sorry, there's more complicated metrics which don't really map well to tools like Prometheus
[10:24.680 --> 10:29.480]  or M3, Datadog, whatever.
[10:29.480 --> 10:35.480]  So this is PG stat activity, there's one row for each session, there's actually two,
[10:35.480 --> 10:43.600]  there's the same, just to explain what you're looking at the first results that there are
[10:43.600 --> 10:56.680]  the first half dozen or dozen columns and then the second set there is, I've elided
[10:56.680 --> 11:01.960]  after PID, I've elided those columns and showed you the next bunch of columns just
[11:01.960 --> 11:06.240]  because I wanted to actually make a point about one of those columns that would be way
[11:06.240 --> 11:08.320]  past the edge of the screen.
[11:08.320 --> 11:14.720]  So in PG stat activity you have one row per session on the database and obviously that
[11:14.720 --> 11:21.760]  already is difficult to put into Prometheus because you would be having time series come
[11:21.760 --> 11:26.840]  and go every time an application connects and disconnects.
[11:26.840 --> 11:34.520]  Probably what people actually, I think what Postgres exporter puts in the data is aggregates,
[11:34.520 --> 11:41.560]  it just puts an account of how many rows are present and then maybe account of how many,
[11:41.560 --> 11:45.480]  the minimum maximum of some of these columns.
[11:45.480 --> 11:50.360]  But there is data in here like the weight event type and weight event, those are text
[11:50.360 --> 11:57.080]  strings, inside Postgres those are actually ID numbers but they get presented to the user
[11:57.080 --> 12:07.440]  in a nice readable format which then if you want to make metrics of you probably then
[12:07.440 --> 12:11.960]  turn them back into numbers or you put them in labels, they're difficult to really make
[12:11.960 --> 12:17.880]  use of in a time series database.
[12:17.880 --> 12:36.760]  Some of them are quite important to have some idea, so there's information there that will
[12:36.760 --> 12:42.640]  show you in PG stat activity that will show you if a session is in a transaction, an idle
[12:42.640 --> 12:47.760]  and you really do want to know if there's a session that's idle in transaction for a
[12:47.760 --> 12:49.880]  long period of time.
[12:49.880 --> 12:57.600]  So what most people do there is have an aggregate, they have one gauge for the maximum, the longest
[12:57.600 --> 13:08.680]  time that any session has been idle in transaction.
[13:08.680 --> 13:15.520]  So just to be clear what we're talking about here is Postgres exporter which is connecting
[13:15.520 --> 13:24.000]  to Postgres QL and querying PG stat user tables, PG stat user indexes, PG stat activity,
[13:24.000 --> 13:32.360]  all the various views that start with PG stat, it can also, Postgres exporter is very flexible,
[13:32.360 --> 13:40.920]  you can configure customized queries to query other views that like some of the PG stat
[13:40.920 --> 13:46.320]  views you might want more detail than the default queries.
[13:46.320 --> 13:57.440]  So it doesn't actually include all those table statistics by default if you have an application
[13:57.440 --> 14:03.080]  where your schema is fairly static and you have a reasonable number of tables to do that
[14:03.080 --> 14:09.000]  with, you can quite reasonably get all of those columns, put them in Prometheus and be able
[14:09.000 --> 14:16.920]  to do all kinds of nice graphs and visualizations, but that's not standard.
[14:16.920 --> 14:20.960]  And if you're, on the other hand, you're an ISP with hundreds of customers and your customers
[14:20.960 --> 14:29.600]  create and drop tables without your control, then you can't really be trying to gather
[14:29.600 --> 14:38.680]  statistics like that because you're taking on an unbounded cardinality and time series
[14:38.680 --> 14:42.400]  coming and going without being able to control it.
[14:42.400 --> 14:53.880]  So the level of detail that you grab is very dependent on how you're using Postgres, whether
[14:53.880 --> 14:59.960]  you're a site with one key database that you want to optimize or many, many databases
[14:59.960 --> 15:06.560]  that you just want to monitor at a high level or an application that you're controlling
[15:06.560 --> 15:11.680]  versus applications that you're hosting for other people.
[15:11.680 --> 15:17.400]  It also means that many sites add queries in Postgres exported query, other data sources
[15:17.400 --> 15:21.680]  like what I've put in this diagram here is PGSTAT statements, which is an extension in
[15:21.680 --> 15:27.200]  Postgres, which gathers statistics for your queries.
[15:27.200 --> 15:34.600]  So the key in there is a query ID, which is like a hash of the query with the constants
[15:34.640 --> 15:41.400]  removed, and you can get long-lived statistics about which queries are taking a lot of time
[15:41.400 --> 15:50.160]  or doing a lot of ale, but that's, again, like a custom query that you would be adding.
[15:50.160 --> 15:59.480]  So I talked a bit about the map, like the difficulty mapping some of these metrics for me to use.
[15:59.480 --> 16:06.000]  There's other problems with, am I doing it for time?
[16:06.000 --> 16:08.000]  Am I doing it for time?
[16:08.000 --> 16:17.840]  There's, I don't, okay, there are, so I do want to talk a bit about the kinds of problems
[16:17.840 --> 16:20.880]  that we have.
[16:20.880 --> 16:25.600]  Some of the metrics don't map very well to Prometheus metrics.
[16:25.600 --> 16:31.000]  The fact that the metrics can be customized, and in fact kind of have to be customized
[16:31.000 --> 16:36.400]  because Postgres is used in different ways at different sites, means that there's no,
[16:36.400 --> 16:41.320]  there is a standard dashboard in Grafana for Postgres, but it's a very high-level dashboard.
[16:41.320 --> 16:45.440]  I think I do have a screenshot there, yeah.
[16:45.440 --> 16:53.840]  There is a dashboard for Postgres, but it, this is not showing individual tables and
[16:53.840 --> 17:01.120]  individual functions and so on, because on many sites that data wouldn't even be present.
[17:01.120 --> 17:06.400]  You have to add custom queries for it.
[17:06.400 --> 17:08.200]  It also means you have to deploy the agent.
[17:08.200 --> 17:14.240]  You have to run this side, this Go program alongside your database everywhere you deploy
[17:14.240 --> 17:20.760]  your database, or you could, depending on how you deploy it, you can deploy a single one
[17:20.880 --> 17:27.480]  for all your databases, or one for all the databases running on one host, so that mapping
[17:27.480 --> 17:37.320]  of which agent to, which agents metrics correspond to which actual database is entirely dependent
[17:37.320 --> 17:42.120]  on how you manage your deploys.
[17:42.120 --> 17:52.040]  The other problem that I've, the, I can't go into all of the problems, but the, the
[17:52.040 --> 18:00.480]  op, the Rezors contention, I gave names to each of these classes, but, so the Rezors
[18:00.480 --> 18:07.880]  contention problems are that Postgres, because it's exposing this information through SQL,
[18:07.880 --> 18:13.480]  means that you have to have a working SQL session in order to get the metrics.
[18:13.480 --> 18:18.760]  So when your system is not functioning correctly, you're very likely to also lose all your data,
[18:18.760 --> 18:23.120]  which you need to debug the problem.
[18:23.120 --> 18:28.520]  So if you're running low-end connections, or you're running into transaction wraparound,
[18:28.520 --> 18:35.280]  or the system is just out of memory, or getting disk errors, quite often you also lose all
[18:35.280 --> 18:40.360]  your metrics that would allow you to figure out which application component is using all
[18:40.360 --> 18:45.640]  the connections, or which table is it that needs to be vacuumed to recover from the transaction
[18:45.640 --> 18:49.280]  wraparound issue.
[18:49.280 --> 18:57.400]  I actually tried to, I've run into a problem where a table was locked by the application,
[18:57.400 --> 19:03.240]  and the custom queries needed that same lock.
[19:03.240 --> 19:09.760]  So the queries all disappeared, the metrics all disappeared, because the Postgres exporter
[19:09.760 --> 19:12.000]  was getting blocked on that lock.
[19:12.000 --> 19:19.920]  When I tried to recreate it for a demo, I actually found, oh, this is not a lock, this
[19:19.920 --> 19:26.000]  is, I actually caused the regression test on Postgres to fail, because one of the regression
[19:26.000 --> 19:29.600]  tests tries to drop a database.
[19:29.600 --> 19:34.400]  And the Postgres exporter keeps a connection to each database, because the metrics, like
[19:34.400 --> 19:41.080]  I said, you need a session, you need a connection to the database, so you need, in Postgres,
[19:41.080 --> 19:44.280]  each session is tied to a specific database.
[19:44.280 --> 19:52.800]  So if you have a dozen databases, it uses a dozen connections, and it keeps those connections,
[19:52.800 --> 19:58.760]  it's optional, it's to work around the problem that it might not be able to connect if you
[19:58.760 --> 20:05.000]  have a problem, but as a result, it has persistent connections to those databases, and the regression
[20:05.000 --> 20:08.040]  test failed when they tried to drop that database.
[20:08.040 --> 20:09.760]  And that could actually happen in production.
[20:09.760 --> 20:14.880]  If you try to do a deploy and roll out a new version of some data that drops a database
[20:14.880 --> 20:21.320]  and recreates it from scratch, if you have Postgres exporter running and it has a connection,
[20:21.320 --> 20:25.360]  you could run into the same kind of issue.
[20:25.360 --> 20:37.520]  So I'm hoping, I'm already working on something to replace Postgres exporter with a background
[20:37.520 --> 20:42.360]  worker inside Postgres, so you would be connecting directly to Postgres, you wouldn't have to
[20:42.360 --> 20:49.600]  deploy a separate program alongside it, and my goal is that that program would have standardized
[20:49.600 --> 20:53.840]  metrics.
[20:53.840 --> 20:59.960]  That program would have standardized metrics that every dashboard or visualization or alerting,
[20:59.960 --> 21:06.400]  so we could have mix-ins that have alert rules and visualizations, and it would be able to
[21:06.400 --> 21:14.400]  rely on standardized metrics that will always be present, and they would be exported directly
[21:14.400 --> 21:21.840]  from shared memory without going through the whole SQL infrastructure.
[21:21.840 --> 21:29.240]  So it would avoid depending on locks and transactions and all of the things that could interfere
[21:29.240 --> 21:33.960]  with or be interfered with by the application.
[21:33.960 --> 21:40.240]  It's still early days, I have a little proof of concept, but it's not going to be in the
[21:40.240 --> 21:46.040]  next version of Postgres, it's definitely experimental.
[21:46.040 --> 21:52.600]  The main difficulties are going to be sort of definitional problems of, for example,
[21:52.600 --> 21:57.480]  the table names, like I mentioned before, should a time series change when a table gets
[21:57.480 --> 21:58.880]  renamed?
[21:58.880 --> 22:04.080]  But in fact, I have a bigger problem because the table names are in the catalog, the schema
[22:04.080 --> 22:09.640]  catalog, they're not in shared memory, and they're not, we don't really want them in
[22:09.640 --> 22:21.600]  shared memory, that brings in the whole risk of character encoding changes and collations.
[22:21.600 --> 22:32.640]  So there's, it probably will only replace the core database metrics, and then you would
[22:32.640 --> 22:37.600]  still probably deploy a tool like Postgres Explorer only for your custom queries, only
[22:37.600 --> 22:49.160]  for more application level metrics, not monitoring core Postgres metrics.
[22:49.160 --> 22:54.520]  So my hope is that when you deploy Postgres, you can add it to your targets in Prometheus
[22:54.520 --> 23:11.200]  and not have to do any further operational work to get dashboards and alerts.
[23:11.200 --> 23:13.200]  Two more minutes.
[23:13.200 --> 23:20.160]  It feels like time is elastic here.
[23:20.160 --> 23:32.480]  So I skipped over, I mean, so this is the proof of concept.
[23:32.480 --> 23:37.120]  The telemetry server in the first PS listing there is a single process.
[23:37.120 --> 23:43.560]  It's a Postgres background worker that can be, you can connect to it and get metrics
[23:43.560 --> 23:47.440]  with just ID numbers for the tables.
[23:47.440 --> 23:52.200]  And the second example is Postgres Exporter, and you can see there's a session, there's
[23:52.200 --> 23:57.360]  a database session, and with Postgres Exporter, there's a database session for each database,
[23:57.360 --> 24:00.760]  and they're all idle.
[24:00.760 --> 24:07.480]  So even just reducing the number of sessions and reducing the number of processes involved
[24:07.480 --> 24:16.440]  is already quite a visible improvement.
[24:16.440 --> 24:23.120]  I think I have more information if people have questions or want to see something specific,
[24:23.120 --> 24:30.600]  but I tried to condense a much longer presentation to 25 minutes, so I've skipped over plenty
[24:30.600 --> 24:31.600]  of other information.
[24:31.600 --> 24:37.160]  If there's questions, that would be probably better than me just jumping around finding
[24:37.160 --> 24:49.600]  a slide.
[24:49.600 --> 24:59.960]  Okay, so any questions, thanks a lot for the great talk, it was pretty interesting.
[24:59.960 --> 25:04.800]  So any questions, anyone?
[25:04.800 --> 25:15.440]  Hello, my name is Brian, you spoke about metrics, is there any traces or any talk of traces
[25:15.440 --> 25:17.960]  in the future?
[25:17.960 --> 25:25.760]  I have ideas, I have plans, but they're all in my head, there's no code.
[25:25.760 --> 25:31.400]  Postgres does have explain plans, and explain plans are basically traces, but there's no
[25:31.400 --> 25:37.440]  way to, what we have today is you run something on the terminal and you see the plan for your
[25:37.440 --> 25:44.440]  query, and there's an extension that will dump the explain plans in the logs.
[25:44.440 --> 25:50.880]  So it wouldn't be much, it's a bit pie in the sky, but I don't see any reason we shouldn't
[25:50.880 --> 26:00.160]  be exporting that same information to a tracing server, and that basically just involves adding
[26:00.160 --> 26:10.440]  support for receiving the trace IDs, the spans, and creating spans for either plan nodes or
[26:10.440 --> 26:18.280]  certain kinds of plan nodes, there's a lot of, it's not well thought out plans.
[26:18.280 --> 26:26.720]  In my pie in the sky dream there is I want to be able to answer the question, which front
[26:26.720 --> 26:34.840]  end web API endpoint is causing sequential scans on this table over here, skipping the
[26:34.840 --> 26:41.680]  whole stack trace in the middle without having to dig all the way up.
[26:41.680 --> 26:49.480]  So we have a architecture in which we have Postgres databases which are short lived running
[26:49.480 --> 26:56.280]  in Docker containers, so the entire cluster basically will live and die for matters of
[26:56.280 --> 26:59.800]  possibly minutes or less.
[26:59.800 --> 27:03.000]  And we would like to know what the hell is going on with them, have you got any bright
[27:03.000 --> 27:10.080]  ideas?
[27:10.080 --> 27:18.240]  I admit I don't think I've seen anybody trying to do that with Prometheus.
[27:18.240 --> 27:25.240]  I mean it's not a best practice in Prometheus to have time series that keep changing, but
[27:25.240 --> 27:34.560]  you're kind of inevitably going to get a new bunch of time series with each database.
[27:34.560 --> 27:40.000]  I guess I need a better idea where you're looking.
[27:40.000 --> 27:44.880]  I don't think I have anything off the top of my head that you wouldn't have already
[27:44.880 --> 27:46.600]  thought about.
[27:46.600 --> 27:54.800]  Hi, where can we get your proof concept from and fill it with it and test it?
[27:54.800 --> 27:56.840]  I'm sorry, I didn't hear the question.
[27:56.840 --> 28:02.080]  Where can you get your proof of concept from to test it and fill it with it?
[28:02.080 --> 28:07.960]  I posted a patch to the mailing list.
[28:07.960 --> 28:16.720]  Postgres follows a fairly old school patch review process where patches are mailed to
[28:16.720 --> 28:19.000]  the hacker's mailing list.
[28:19.000 --> 28:41.640]  So it's easy to lose sight of patches if they get posted and it was months ago.
[28:41.640 --> 28:45.640]  I can send it to you if you want.
[28:45.640 --> 28:49.360]  You can probably find it on the mailing list if you search.
[28:49.360 --> 28:50.960]  It's pretty early days though.
[28:50.960 --> 29:04.680]  It's not really ready to use even for experimental production uses.
[29:04.680 --> 29:12.160]  With that integrated matrix, how do you expose the matrix to have an HTTP endpoint that
[29:12.160 --> 29:18.600]  exposed directly from Postgres?
[29:18.600 --> 29:23.960]  The current situation is it's a background worker and that background worker has a configuration
[29:23.960 --> 29:35.080]  option to specify a second port to listen on and it runs a very small embedded web server
[29:35.080 --> 29:40.920]  so it responds to normal HTTPS requests.
[29:41.520 --> 29:49.320]  I would want the normal Postgres port to respond so that your label, your target is just the
[29:49.320 --> 29:52.680]  database port.
[29:52.680 --> 29:59.640]  I expect, well, I actually have already heard a lot of pushback on that idea.
[29:59.640 --> 30:10.240]  A lot of Postgres installs are sort of old school where you probably have it firewalled
[30:10.360 --> 30:15.720]  and you don't want to have two different, you don't want to have a new service running
[30:15.720 --> 30:21.720]  on a port, the same port as the actual database, you want to have a port that you can firewall
[30:21.720 --> 30:24.800]  separately for your admin stack.
[30:24.800 --> 30:33.040]  It makes Prometheus very difficult to manage when you have a different port to get metrics
[30:33.040 --> 30:34.480]  about.
[30:34.480 --> 30:39.200]  So you have database running on port A and then you have metrics on port B and you have
[30:39.280 --> 30:46.800]  to have your dashboards and the targets and so on all configured to understand that the
[30:46.800 --> 30:52.320]  target with port B is actually the database on port A and you can add rewrite rules but
[30:52.320 --> 30:56.920]  then you have to manage those rewrite rules.
[30:56.920 --> 31:06.520]  But I don't really expect people to accept the idea of responding on the database port.
[31:06.520 --> 31:14.280]  There's also a general security principle involved of, it's almost always a terrible
[31:14.280 --> 31:19.520]  idea for security reasons to respond to two different protocols on the same port because
[31:19.520 --> 31:27.120]  a lot of security vulnerabilities have come about from arranging, like finding bugs where
[31:27.120 --> 31:31.360]  one side of a connection thinks you're talking protocol A and the other side thinks you're
[31:31.360 --> 31:33.720]  talking protocol B.
[31:33.720 --> 31:39.360]  So it's probably, there's big trade-offs to doing that.
[31:39.360 --> 31:46.280]  First of all, thanks a lot for the amazing talk, very insightful.
[31:46.280 --> 31:49.360]  Thanks for offering to modernize POSGRACE monitoring.
[31:49.360 --> 31:55.160]  You had a very good point there about standardizing on the metrics.
[31:55.160 --> 31:59.800]  I've been involved in the semantic conventions around open telemetry and other projects but
[31:59.800 --> 32:06.080]  in general, I'm curious to hear if you personally or Ivan or anyone else, what kind of effort
[32:06.080 --> 32:13.040]  is being done to standardize on database monitoring metrics, not specifically POSGRACE but databases
[32:13.040 --> 32:17.800]  in general, if you can share?
[32:17.800 --> 32:19.200]  I would be interested in that.
[32:19.200 --> 32:26.040]  I haven't heard anything on that front.
[32:26.040 --> 32:27.040]  That would be exciting.
[32:27.040 --> 32:29.640]  That would be a lot of work.
[32:29.640 --> 32:43.120]  I think there's a lot of, a lot of the interesting metrics are very, that would be difficult.
[32:43.120 --> 32:50.120]  I don't know, I haven't seen anything like that.
[32:50.120 --> 32:59.280]  Okay, so thanks a lot everyone.