{"id":3455,"date":"2025-04-22T12:46:31","date_gmt":"2025-04-22T12:46:31","guid":{"rendered":"https:\/\/kedar.nitty-witty.com\/blog\/?p=3455"},"modified":"2025-04-22T12:58:54","modified_gmt":"2025-04-22T12:58:54","slug":"how-to-resolve-disk-space-issues-in-pmm-case-study","status":"publish","type":"post","link":"https:\/\/kedar.nitty-witty.com\/blog\/how-to-resolve-disk-space-issues-in-pmm-case-study","title":{"rendered":"How to Resolve Disk Space Issues in PMM: Case Study"},"content":{"rendered":"\n<p>Recently, I encountered a Percona Monitoring and Management (PMM) server that was rapidly approaching complete disk exhaustion. This post outlines the steps taken to identify the issue and reclaim disk space.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>$ df -h\nFilesystem Size Used Avail Use% Mounted on\n\/dev\/sda1 523G 508G 16G 98% \/<\/code><\/pre>\n\n\n\n<p>With only 16GB remaining on a 523GB drive, this PMM installation was on the brink of failure. If you&#8217;re experiencing similar issues with your PMM deployment, this guide details how to identify and resolve PMM disk space issues.<\/p>\n\n\n\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p>BTW did you know, <a href=\"https:\/\/docs.percona.com\/percona-monitoring-and-management\/3\/release-notes\/3.0.0.html\" target=\"_blank\" rel=\"noopener nofollow\" title=\"\">PMM 3.0<\/a> is already out?<\/p>\n<\/blockquote>\n\n\n\n<h2 class=\"wp-block-heading\">Identifying the Culprit<\/h2>\n\n\n\n<p>After inspecting common culprits like system logs and temporary files, the main disk usage was traced to <code>\/var\/local\/percona\/pmm\/srv\/<\/code>, a mount used by the PMM container. Within it, the ClickHouse database specific folder was the primary space consumer.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">ClickHouse in PMM<\/h2>\n\n\n\n<p>Probably you already know but if not, question is, what&#8217;s ClickHouse is housing in PMM&#8217;s architecture that is utilizing so much disk? <\/p>\n\n\n\n<p>ClickHouse is an open-source, column-oriented database management system that excels at real-time analytics on large datasets. Within the PMM ecosystem, <strong>ClickHouse<\/strong> facilitates the <strong>Query Analytics<\/strong> functionality. This explains why the ClickHouse database can grow significantly over time, especially in environments monitoring numerous database instances with heavy query loads.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Investigation: Tracking Down the Biggest Tables<\/h2>\n\n\n\n<p>To identify which specific tables were responsible for the excessive disk usage, I connected to the PMM server container:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code># Connect to the PMM server container\npodman exec -it pmm-server \/bin\/bash\n\n# or if you're using docker\ndocker exec -it pmm-server \/bin\/bash<\/code><\/pre>\n\n\n\n<p>Once inside the container, following queries was run against ClickHouse to identify the largest tables:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>$ clickhouse-client<\/code><\/pre>\n\n\n\n<pre class=\"wp-block-code\"><code>SELECT database, table, SUM(bytes_on_disk) AS size_on_disk, COUNT() AS parts_count \nFROM system.parts \nWHERE active = 1 \nGROUP BY database, table \nORDER BY size_on_disk DESC;<\/code><\/pre>\n\n\n\n<pre class=\"wp-block-code\"><code>\u250c\u2500database\u2500\u252c\u2500table\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500size_on_disk\u2500\u252c\u2500parts_count\u2500\u2510\n\u2502 system   \u2502 trace_log                 \u2502 346710891455 \u2502          48 \u2502\n\u2502 system   \u2502 asynchronous_metric_log   \u2502   4216804400 \u2502          37 \u2502\n\u2502 system   \u2502 metric_log_0              \u2502   4096890831 \u2502          84 \u2502\n\u2502 system   \u2502 metric_log                \u2502   3300187902 \u2502          35 \u2502\n...<\/code><\/pre>\n\n\n\n<p>The <code>system.trace_log<\/code> table was consuming a whopping 346GB! This table stores trace information for ClickHouse queries and operations.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Drilling Down: Analyzing Partitions<\/h2>\n\n\n\n<p>To understand the issue with the trace_log table, its partition details were reviewed:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>SELECT database, table, partition, sum(rows) AS rows, sum(bytes_on_disk) AS size_on_disk \nFROM system.parts \nWHERE active = 1 AND table = 'trace_log' \nGROUP BY database, table, partition \nORDER BY partition ASC;<\/code><\/pre>\n\n\n\n<p>The output of above query exhibited the problem:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>\u250c\u2500database\u2500\u252c\u2500table\u2500\u2500\u2500\u2500\u2500\u252c\u2500partition\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500rows\u2500\u252c\u2500size_on_disk\u2500\u2510\n\u2502 system   \u2502 trace_log \u2502 202405    \u2502   19230736 \u2502    353844712 \u2502\n\u2502 system   \u2502 trace_log \u2502 202406    \u2502 2972432060 \u2502  48032655534 \u2502\n\u2502 system   \u2502 trace_log \u2502 202407    \u2502 3500442984 \u2502  56447376224 \u2502\n\u2502 system   \u2502 trace_log \u2502 202408    \u2502 3381851250 \u2502  53964707527 \u2502\n\u2502 system   \u2502 trace_log \u2502 202409    \u2502 3138145807 \u2502  50178088805 \u2502\n\u2502 system   \u2502 trace_log \u2502 202410    \u2502 3307448245 \u2502  52953410205 \u2502\n\u2502 system   \u2502 trace_log \u2502 202411    \u2502 3184102845 \u2502  51206343208 \u2502\n\u2502 system   \u2502 trace_log \u2502 202412    \u2502 2093835550 \u2502  33552956631 \u2502\n\u2502 system   \u2502 trace_log \u2502 202501    \u2502    1183332 \u2502     21518446 \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518<\/code><\/pre>\n\n\n\n<p>The table contained partitions dating back over 8 months! This historical data was consuming massive amounts of disk space but provided little value for current monitoring needs.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Looking at the Table Structure<\/h2>\n\n\n\n<pre class=\"wp-block-code\"><code>SHOW CREATE TABLE system.trace_log;<\/code><\/pre>\n\n\n\n<p>The output confirmed that the table was partitioned by month but had no TTL (Time To Live) policy to automatically purge old data:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>CREATE TABLE system.trace_log\n(\n    `event_date` Date,\n    `event_time` DateTime,\n    ...\n)\nENGINE = MergeTree\nPARTITION BY toYYYYMM(event_date)\nORDER BY (event_date, event_time)\nSETTINGS index_granularity = 8192<\/code><\/pre>\n\n\n\n<h2 class=\"wp-block-heading\">The Solution: Two-Part Approach<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">1. Implementing Automatic Data Retention with TTL<\/h3>\n\n\n\n<p>Our first step was to add a TTL policy to automatically remove data older than 30 days:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code># Edit the ClickHouse configuration\nvi \/etc\/clickhouse-server\/config.xml<\/code><\/pre>\n\n\n\n<p>added the following TTL definition in the specific section for the table trace_log:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>&lt;ttl&gt;event_date + INTERVAL 30 DAY DELETE&lt;\/ttl&gt;<\/code><\/pre>\n\n\n\n<p>Then restarted ClickHouse to apply the changes:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>supervisorctl restart clickhouse<\/code><\/pre>\n\n\n\n<p>After the restart, we verified the change was applied by checking the table definition again:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>SHOW CREATE TABLE system.trace_log;<\/code><\/pre>\n\n\n\n<p>The updated definition now included:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>&#91;root@pmm-host-server opt]# clickhouse-client\nClickHouse client version 23.8.2.7 (official build).\nConnecting to localhost:9000 as user default.\nConnected to ClickHouse server version 23.8.2 revision 54465.\n\npmm-host-server :) SHOW CREATE TABLE system.trace_log;\n\nSHOW CREATE TABLE system.trace_log\n\nQuery id: 4a2f87cb-0030-4751-b17d-08aa84f38abb\n\n\u250c\u2500statement\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502 CREATE TABLE system.trace_log\n(\n    `event_date` Date,\n    `event_time` DateTime,\n    `event_time_microseconds` DateTime64(6),\n    `timestamp_ns` UInt64,\n    `revision` UInt32,\n    `trace_type` Enum8('Real' = 0, 'CPU' = 1, 'Memory' = 2, 'MemorySample' = 3, 'MemoryPeak' = 4, 'ProfileEvent' = 5),\n    `thread_id` UInt64,\n    `query_id` String,\n    `trace` Array(UInt64),\n    `size` Int64,\n    `ptr` UInt64,\n    `event` LowCardinality(String),\n    `increment` Int64\n)\nENGINE = MergeTree\nPARTITION BY toYYYYMM(event_date)\nORDER BY (event_date, event_time)\n<strong>TTL event_date + toIntervalDay(30)<\/strong>\nSETTINGS index_granularity = 8192 \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518<\/code><\/pre>\n\n\n\n<h3 class=\"wp-block-heading\">2. Manually Removing Historical Data<\/h3>\n\n\n\n<p>I decided to manually remove the existing historical data to reclaim disk space immediately and began dropping old partitions:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>ALTER TABLE system.trace_log_1 DROP PARTITION '202405';\nALTER TABLE system.trace_log_1 DROP PARTITION '202406';<\/code><\/pre>\n\n\n\n<p>But it wasn&#8217;t straight forward\u2026 I hit a roadblock by configuration variable<em> max_[table\/partition]_size_to_drop<\/em> as the partition being dropped is more than 50G the operation was aborted with an error and possible solutions.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>pmm-host-server :) ALTER TABLE system.trace_log_1 DROP PARTITION '202407';\n\nALTER TABLE system.trace_log_1\n    DROP PARTITION '202407'\n\nQuery id: f2b93063-57df-444f-9199-82c7695f39b9\n\n\n0 rows in set. Elapsed: 0.003 sec.\n\nReceived exception from server (version 23.8.2):\nCode: 359. DB::Exception: Received from localhost:9000. DB::Exception: Table or Partition in system.trace_log_1 was not dropped.\nReason:\n1. Size (56.45 GB) is greater than max_&#91;table\/partition]_size_to_drop (50.00 GB)\n2. File '\/srv\/clickhouse\/flags\/force_drop_table' intended to force DROP doesn't exist\nHow to fix this:\n1. Either increase (or set to zero) max_&#91;table\/partition]_size_to_drop in server config\n2. Either create forcing file \/srv\/clickhouse\/flags\/force_drop_table and make sure that ClickHouse has write permission for it.\nExample:\nsudo touch '\/srv\/clickhouse\/flags\/force_drop_table' &amp;&amp; sudo chmod 666 '\/srv\/clickhouse\/flags\/force_drop_table'. (TABLE_SIZE_EXCEEDS_MAX_DROP_SIZE_LIMIT)\n\npmm-host-server :)\n<\/code><\/pre>\n\n\n\n<p><strong>ClickHouse has a safety mechanism that prevents dropping large partitions (&gt;50GB) without explicit confirmation.<\/strong> <\/p>\n\n\n\n<p>To work around this, I created a force drop flag as noted in the instructions above:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>touch '\/srv\/clickhouse\/flags\/force_drop_table' &amp;&amp; chmod 666 '\/srv\/clickhouse\/flags\/force_drop_table'<\/code><\/pre>\n\n\n\n<p>With this flag in place, it&#8217;s possible to drop the larger partitions:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>ALTER TABLE system.trace_log_1 DROP PARTITION '202407';<\/code><\/pre>\n\n\n\n<p>Note that the flag is automatically removed after each operation, so it needed to be recreated before dropping each large partition.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Results and Long-term Prevention<\/h2>\n\n\n\n<p>After implementing our solution, I successfully reduced the disk usage and established a sustainable automatic cleanup process. The TTL configuration ensures that trace logs older than 30 days will be automatically purged, preventing future disk space issues.<\/p>\n\n\n\n<p>The Percona team is aware of this issue and is working on implementing better default TTL policies for system tables in future PMM releases. A bug report (<a href=\"https:\/\/perconadev.atlassian.net\/browse\/PMM-13644\" target=\"_blank\" rel=\"noopener nofollow\" title=\"\">PMM-13644<\/a>) has been filed to address this issue more permanently.<\/p>\n\n\n\n<p>Until these improvements are implemented in future PMM versions, the workaround described in this case study will help you manage ClickHouse disk usage effectively.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Conclusion<\/h2>\n\n\n\n<p>Reclaiming disk space in PMM is crucial for maintaining optimal server performance and monitoring efficiency. As shown in this case study, understanding PMM&#8217;s internal components, such as the <code>trace_log<\/code> table in ClickHouse, helps identify the root cause of disk usage issues.<\/p>\n\n\n\n<p>Have you faced similar issues with your PMM deployment? Share your strategies for managing disk usage in the comments below!<\/p>\n","protected":false},"excerpt":{"rendered":"Recently, I encountered a Percona Monitoring and Management (PMM) server that was rapidly approaching complete disk exhaustion. This post outlines the steps taken to identify the issue and reclaim disk&hellip;\n","protected":false},"author":1,"featured_media":3462,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"om_disable_all_campaigns":false,"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"footnotes":""},"categories":[8,1089],"tags":[1079,1077,427,1085,1084,1082,1087,1088,1080],"class_list":{"0":"post-3455","1":"post","2":"type-post","3":"status-publish","4":"format-standard","5":"has-post-thumbnail","7":"category-mysql","8":"category-percona-monitoring-and-management","9":"tag-clickhouse","10":"tag-disk-space","11":"tag-mysql","12":"tag-pmm","13":"tag-pmm-case-study","14":"tag-pmm-server","15":"tag-reclaim-disk-space","16":"tag-resource-management","17":"tag-system-stability"},"aioseo_notices":[],"amp_enabled":true,"_links":{"self":[{"href":"https:\/\/kedar.nitty-witty.com\/blog\/wp-json\/wp\/v2\/posts\/3455","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/kedar.nitty-witty.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/kedar.nitty-witty.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/kedar.nitty-witty.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/kedar.nitty-witty.com\/blog\/wp-json\/wp\/v2\/comments?post=3455"}],"version-history":[{"count":3,"href":"https:\/\/kedar.nitty-witty.com\/blog\/wp-json\/wp\/v2\/posts\/3455\/revisions"}],"predecessor-version":[{"id":3460,"href":"https:\/\/kedar.nitty-witty.com\/blog\/wp-json\/wp\/v2\/posts\/3455\/revisions\/3460"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/kedar.nitty-witty.com\/blog\/wp-json\/wp\/v2\/media\/3462"}],"wp:attachment":[{"href":"https:\/\/kedar.nitty-witty.com\/blog\/wp-json\/wp\/v2\/media?parent=3455"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/kedar.nitty-witty.com\/blog\/wp-json\/wp\/v2\/categories?post=3455"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/kedar.nitty-witty.com\/blog\/wp-json\/wp\/v2\/tags?post=3455"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}