<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:hashnode="https://hashnode.com/rss"><channel><title><![CDATA[Moshe Nadler's blog]]></title><description><![CDATA[Moshe Nadler's blog]]></description><link>https://moshe0076.hashnode.dev</link><generator>RSS for Node</generator><lastBuildDate>Tue, 03 Dec 2024 08:08:01 GMT</lastBuildDate><atom:link href="https://moshe0076.hashnode.dev/rss.xml" rel="self" type="application/rss+xml"/><language><![CDATA[en]]></language><ttl>60</ttl><atom:link rel="next" href="https://moshe0076.hashnode.dev/rss.xml?page=2"/><atom:link rel="previous" href="https://moshe0076.hashnode.dev/rss.xml"/><item><title><![CDATA[Trivy - Shifting Security From Right to Left and then Right Again]]></title><description><![CDATA[Trivy - Shifting Security From Right to Left and then Right Again]]></description><link>https://moshe0076.hashnode.dev/trivy-shifting-security-from-right-to-left-and-then-right-again</link><guid isPermaLink="true">https://moshe0076.hashnode.dev/trivy-shifting-security-from-right-to-left-and-then-right-again</guid><category><![CDATA[Devops]]></category><category><![CDATA[DevSecOps]]></category><category><![CDATA[CI/CD]]></category><category><![CDATA[Kubernetes]]></category><category><![CDATA[Microservices]]></category><category><![CDATA[Supply Chain Attack]]></category><category><![CDATA[Security]]></category><category><![CDATA[trivy]]></category><dc:creator><![CDATA[Moshe Nadler]]></dc:creator><pubDate>Fri, 02 Feb 2024 17:16:25 GMT</pubDate><content:encoded>&lt;![CDATA[&lt;h2 id=&quot;heading-introduction&quot;&gt;Introduction&lt;/h2&gt;&lt;p&gt;A fundamental principle that I have come to appreciate, after building CI/CD pipelines for some time, is the importance of identifying any potential issues at the earliest possible stage. This way, the team can spot and fix the issues before they leak into development or production environments. And guess what? This approach even has name; it&apos;s known as &quot;&lt;a target=&quot;_blank&quot; href=&quot;https://www.aquasec.com/cloud-native-academy/devsecops/shift-left-devops/&quot;&gt;shift left&lt;/a&gt;&quot;.&lt;/p&gt;&lt;p&gt;Unit tests are the classic example of the &quot;shift left&quot; paradigm (&lt;strong&gt;Image 1&lt;/strong&gt;). Your &lt;strong&gt;Continuous Integration&lt;/strong&gt; (CI) system pulls the code from your favorite source control and runs a bunch of unit-tests to make sure that the changes made to the code will not break the application. If any of the tests fail, the CI pipeline will break, and a notification will likely be sent for the team to be aware that the new version has issues. The end goal is to prevent a potential buggy version from being deployed to the development environments, or worse, to production.&lt;/p&gt;&lt;p&gt;&lt;img src=&quot;https://cdn.hashnode.com/res/hashnode/image/upload/v1705908883579/a6b2f290-1d70-4045-9e24-cadd4abfa1fc.png&quot; alt=&quot;TrivyCICD1.jpg&quot; class=&quot;image--center mx-auto&quot; /&gt;&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Image 1. CI/CD Flow of a Kubernetes application based on GitHub, Jenkins, Slack and ArgoCD&lt;/strong&gt;&lt;/p&gt;&lt;p&gt;But can we embed vulnerability and misconfiguration scanning as part of the CI in the same way we did with unit testing? Owing to the relatively small size of a microservice code base and the compact nature of the base images used for running the code (think &lt;a target=&quot;_blank&quot; href=&quot;https://www.alpinelinux.org/&quot;&gt;Alpine Linux&lt;/a&gt;), the amount of code dependencies and system packages is also modest.&lt;/p&gt;&lt;p&gt;This makes the scanning of both the base image and code dependencies for vulnerabilities and misconfigurations a feasible action in a relatively short time. Therefore, vulnerability and misconfiguration scanning can be done as part of the CI flow. We just need the right tool for the job.&lt;/p&gt;&lt;p&gt;&lt;a target=&quot;_blank&quot; href=&quot;https://trivy.dev/&quot;&gt;Trivy&lt;/a&gt; was nicknamed &quot;The Swiss Army Knife for Security Scanning&quot; and for a very good reason. It can run as a &lt;a target=&quot;_blank&quot; href=&quot;https://aquasecurity.github.io/trivy/v0.48/&quot;&gt;CLI&lt;/a&gt; tool (great for our CI needs) but also as a Kubernetes &lt;a target=&quot;_blank&quot; href=&quot;https://aquasecurity.github.io/trivy-operator/v0.18.2/&quot;&gt;Operator&lt;/a&gt; for continuous scanning of the cluster from inside. The CLI part is really handy. It digs into different parts of our application and highlights any security issues we need to know about. Trivy can scan the file systems (before we package everything inside an image), but it can also scan images (base images or our application image with all of our code inside). In general, Trivy can scan for,&lt;/p&gt;&lt;ul&gt;&lt;li&gt;&lt;p&gt;OS Packages vulnerabilities&lt;/p&gt;&lt;/li&gt;&lt;li&gt;&lt;p&gt;Code dependencies vulnerabilities&lt;/p&gt;&lt;/li&gt;&lt;li&gt;&lt;p&gt;Misconfigurations&lt;/p&gt;&lt;/li&gt;&lt;li&gt;&lt;p&gt;Secrets exposed in the code base&lt;/p&gt;&lt;/li&gt;&lt;li&gt;&lt;p&gt;Much more&lt;/p&gt;&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;Going back to the diagram in &lt;strong&gt;Image 1&lt;/strong&gt;, we can enhance our CI/CD flow and add Trivy abilities to our CI flow (&lt;strong&gt;Image 2&lt;/strong&gt;). If Trivy finds any vulnerabilities, misconfigurations or any other security issue, the build will fail and the CI system will prevent a potential insecure version from being deployed to any of our environments.&lt;/p&gt;&lt;p&gt;&lt;img src=&quot;https://cdn.hashnode.com/res/hashnode/image/upload/v1705909193043/2ade2b25-399e-4801-808e-df6a88c95550.png&quot; alt class=&quot;image--center mx-auto&quot; /&gt;&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Image 2. CI/CD Flow of a Kubernetes application based on GitHub, Jenkins, Slack, ArgoCD with Trivy scanning for issues&lt;/strong&gt;&lt;/p&gt;&lt;p&gt;Vulnerabilities are very dynamic and new ones are found by the hour (more on that in the Shift Right section). Trivy handles this by using a &lt;a target=&quot;_blank&quot; href=&quot;https://github.com/aquasecurity/trivy-db&quot;&gt;Database&lt;/a&gt; which it pulls for updates every 6 hours by default.&lt;/p&gt;&lt;h2 id=&quot;heading-shift-left-trivy-in-the-ci-flow&quot;&gt;Shift Left - Trivy in the CI Flow&lt;/h2&gt;&lt;p&gt;I will not delve into the details of how to install the Trivy CLI tool. There are many CI systems and CI services, and the installation of Trivy can differ from one to the next. The official installation documentation can be found &lt;a target=&quot;_blank&quot; href=&quot;https://aquasecurity.github.io/trivy/v0.48/getting-started/installation/&quot;&gt;here&lt;/a&gt;. A number of examples of integrations with CI systems can be found &lt;a target=&quot;_blank&quot; href=&quot;https://aquasecurity.github.io/trivy/v0.48/tutorials/integrations/&quot;&gt;here&lt;/a&gt;.&lt;/p&gt;&lt;h3 id=&quot;heading-when-to-hit-the-brakes&quot;&gt;When to Hit the Brakes?&lt;/h3&gt;&lt;p&gt;Trivy categorizes the severity of identified vulnerabilities and issues into several levels: &lt;code&gt;UNKNOWN, LOW, MEDIUM, HIGH, CRITICAL&lt;/code&gt;. One of the questions I find myself asking all the time is, &quot;When should I break the build and start searching for a fix?&quot;.&lt;/p&gt;&lt;p&gt;In a well-updated environment, there may be some low and medium vulnerabilities or issues, fewer high-level vulnerabilities or issues, and almost no critical ones. Alerting on the low and medium vulnerabilities or issues can cause much noise in the CI; this will make the CI cycle slow and delivery will be impacted (and frustrate the team). Alerting on high and critical vulnerabilities or issues should be much less noisy, without compromising too much on security (I hope).&lt;/p&gt;&lt;p&gt;So, usually I set Trivy to break the build (by sending exit code 1 to the runtime, &lt;code&gt;--exit-code 1&lt;/code&gt;) when there are high or/and critical vulnerabilities or issues (&lt;code&gt;--severity HIGH,CRITICAL&lt;/code&gt;). For example, when scanning a Dockerfile,&lt;/p&gt;&lt;pre&gt;&lt;code class=&quot;lang-bash&quot;&gt;trivy config --exit-code 1 --severity HIGH,CRITICAL ./Dockerfile&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;Another decision that needs to be made is what if Trivy spots a high or critical issue, but there&apos;s no fix for it right now? Should we stop everything and wait, or just keep going? We can instruct Trivy to ignore vulnerabilities which are found but no fix is available by setting the &lt;code&gt;--ignore-unfixed&lt;/code&gt; flag.&lt;/p&gt;&lt;p&gt;Personally, I am not in favor of using the &lt;code&gt;--ignore-unfixed&lt;/code&gt; flag by default. My preferred approach is to break the build even in the absence of a known fix for a high or critical vulnerability. Then, some research and a decision needs to be made - is this vulnerability a high risk for us or can we live with it for some time until a fix is found? If the vulnerability is not a big deal for us, the &lt;code&gt;--ignore-unfixed&lt;/code&gt; flag can be added for a specific period of time (can be just for one build or for a specific time period).&lt;/p&gt;&lt;p&gt;I use Trivy in three specific steps of the CI flow. The first two are executed before any code dependencies installation or unit tests execution. The last one is executed just before the CI pushes the new application image to the image repository.&lt;/p&gt;&lt;h3 id=&quot;heading-scanning-the-application-dockerfile-and-kubernetes-manifests-files&quot;&gt;Scanning the Application Dockerfile and Kubernetes Manifests Files&lt;/h3&gt;&lt;p&gt;The first step is to scan the Dockerfile which is used to create the application image and the Kubernetes manifests (either helm or other). Trivy should find any misconfigurations that can result in an insecure application image or risky Kubernetes objects.&lt;/p&gt;&lt;p&gt;To understand why we should do this, lets take a look at a very simple example. Lets assume that someone made a change to the Dockerfile, resulting in an application image executing with the &lt;em&gt;root&lt;/em&gt; user. Our original Dockerfile had the non privilege user &quot;my-user&quot; set,&lt;/p&gt;&lt;pre&gt;&lt;code class=&quot;lang-dockerfile&quot;&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;USER&lt;/span&gt; my-&lt;span class=&quot;hljs-keyword&quot;&gt;user&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;But then, someone accidentally switched out the directive to,&lt;/p&gt;&lt;pre&gt;&lt;code class=&quot;lang-dockerfile&quot;&gt;&lt;span class=&quot;hljs-keyword&quot;&gt;USER&lt;/span&gt; root&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;Best practice &lt;a target=&quot;_blank&quot; href=&quot;https://sysdig.com/blog/dockerfile-best-practices/&quot;&gt;dictate&lt;/a&gt; that the application container should execute with a non privilege user. So the change above is a bad practice, which can lead to much trouble. Scanning the Dockerfile with Trivy,&lt;/p&gt;&lt;pre&gt;&lt;code class=&quot;lang-bash&quot;&gt;trivy config --exit-code 1 --severity HIGH,CRITICAL ./Dockerfile&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;Will result in,&lt;/p&gt;&lt;pre&gt;&lt;code class=&quot;lang-bash&quot;&gt;2024-01-23T12:17:19.686+0200    INFO    Misconfiguration scanning is enabled2024-01-23T12:17:20.933+0200    INFO    Detected config files: 1Dockerfile (dockerfile)Tests: 19 (SUCCESSES: 18, FAILURES: 1, EXCEPTIONS: 0)Failures: 1 (HIGH: 1, CRITICAL: 0)HIGH: Last USER &lt;span class=&quot;hljs-built_in&quot;&gt;command&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;in&lt;/span&gt; Dockerfile should not be &lt;span class=&quot;hljs-string&quot;&gt;&apos;root&apos;&lt;/span&gt;Running containers with &lt;span class=&quot;hljs-string&quot;&gt;&apos;root&apos;&lt;/span&gt; user can lead to a container escape situation. It is a best practice to run containers as non-root users, &lt;span class=&quot;hljs-built_in&quot;&gt;which&lt;/span&gt; can be &lt;span class=&quot;hljs-keyword&quot;&gt;done&lt;/span&gt; by adding a &lt;span class=&quot;hljs-string&quot;&gt;&apos;USER&apos;&lt;/span&gt; statement to the Dockerfile.See https://avd.aquasec.com/misconfig/ds002&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;We can do the same with Kubernetes manifests. For example, running the following Trivy scan,&lt;/p&gt;&lt;pre&gt;&lt;code class=&quot;lang-bash&quot;&gt;trivy config --exit-code 1 --severity HIGH,CRITICAL ./helm-charts&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;Resulted in,&lt;/p&gt;&lt;pre&gt;&lt;code class=&quot;lang-bash&quot;&gt;HIGH: Container &lt;span class=&quot;hljs-string&quot;&gt;&apos;my-app&apos;&lt;/span&gt; of Deployment &lt;span class=&quot;hljs-string&quot;&gt;&apos;my-deployment&apos;&lt;/span&gt; should &lt;span class=&quot;hljs-built_in&quot;&gt;set&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;&apos;securityContext.readOnlyRootFilesystem&apos;&lt;/span&gt; to &lt;span class=&quot;hljs-literal&quot;&gt;true&lt;/span&gt;An immutable root file system prevents applications from writing to their &lt;span class=&quot;hljs-built_in&quot;&gt;local&lt;/span&gt; disk. This can &lt;span class=&quot;hljs-built_in&quot;&gt;limit&lt;/span&gt; intrusions, as attackers will not be able to tamper with the file system or write foreign executables to disk.See https://avd.aquasec.com/misconfig/ksv014&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;So, if Trivy finds any issues with the Dockerfile or Kubernetes manifests the build will break.&lt;/p&gt;&lt;h3 id=&quot;heading-scanning-the-application-dependencies&quot;&gt;Scanning the Application &lt;strong&gt;Dependencies&lt;/strong&gt;&lt;/h3&gt;&lt;p&gt;The second step is to scan the application dependencies for vulnerabilities. Trivy supports a large number of &lt;a target=&quot;_blank&quot; href=&quot;https://aquasecurity.github.io/trivy/v0.48/docs/coverage/language/#supported-languages&quot;&gt;programming languages&lt;/a&gt;. This is done by running,&lt;/p&gt;&lt;pre&gt;&lt;code class=&quot;lang-bash&quot;&gt;trivy filesystem --scanners vuln --include-dev-deps --exit-code 1 --severity HIGH,CRITICAL .&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;Note the &lt;code&gt;--include-dev-deps&lt;/code&gt; flag. By default, Trivy will not scan dev-dependencies (like dependencies specified in the &lt;code&gt;devDependencies&lt;/code&gt; section in a npm &lt;code&gt;package.json&lt;/code&gt; file). For me, this is an issue. Although these dependencies will not be part of the final image, they will get installed and used as part of the testing steps. Having dependencies with vulnerabilities used on your CI system can lead to bad news and trouble. If Trivy flags any issues with any of the dependencies, the build will stop just in time.&lt;/p&gt;&lt;h3 id=&quot;heading-scanning-the-final-application-image&quot;&gt;Scanning the Final Application Image&lt;/h3&gt;&lt;p&gt;The third step is to scan the final application image. This happens after all the unit tests were executed and the Dockerfile was used to package the application inside a base image resulting in an application image ready to be deployed.&lt;/p&gt;&lt;p&gt;This step will find issues and vulnerabilities in a number of potential hidden spots,&lt;/p&gt;&lt;ul&gt;&lt;li&gt;&lt;p&gt;The base image itself can have outdated OS packages with vulnerabilities which didn&apos;t exist when the base image was constructed and scanned&lt;/p&gt;&lt;/li&gt;&lt;li&gt;&lt;p&gt;The Dockerfile can install outdated OS packages to the base image, which will introduce OS package vulnerabilities in the final image&lt;/p&gt;&lt;/li&gt;&lt;li&gt;&lt;p&gt;The Dockerfile is capable of installing application dependencies that are beyond the scope of the code base itself, such as executing &lt;code&gt;npm or pip install&lt;/code&gt; commands within the Dockerfile. These external dependencies may introduce potential vulnerabilities.&lt;/p&gt;&lt;/li&gt;&lt;li&gt;&lt;p&gt;Secrets and passwords can find their way into the final application image. This could happen in a few ways, like through the Dockerfile, getting mixed into the base image, or some other route.&lt;/p&gt;&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;The scan itself is done by running,&lt;/p&gt;&lt;pre&gt;&lt;code class=&quot;lang-bash&quot;&gt;trivy image --exit-code 1 --severity HIGH,CRITICAL [IMAGE_NAME]&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;If Trivy spots any issues with the final image, the build will stop right there, preventing the problematic application image from sneaking into the image repository and ending up in one of the Kubernetes clusters.&lt;/p&gt;&lt;h2 id=&quot;heading-shift-right-trivy-inside-the-cluster&quot;&gt;Shift Right - Trivy Inside the Cluster&lt;/h2&gt;&lt;p&gt;Unit tests are coupled to a code version. As long as no changes are introduced to the code, the results of the tests should remain the same. This means that when the tests pass and the code is deployed, we don&apos;t need to rerun the unit tests to assure the stability of the code over time.&lt;/p&gt;&lt;p&gt;Vulnerabilities are different. An application image with no vulnerabilities today may have critical ones tomorrow. Because we usually do not replace &lt;strong&gt;all&lt;/strong&gt; of our running containers inside a Kubernetes cluster on a daily basis (and sometimes not even for weeks or months), the containers images used may accumulate vulnerabilities over time that we are not aware of.&lt;/p&gt;&lt;p&gt;Same goes for configuration files. New configuration best practices are published based on issues and vulnerabilities discovered. So, a configuration files scan done today may miss some new best practices recommended tomorrow (I assume that you never make manual configuration changes, and everything goes via a CI/CD flow).&lt;/p&gt;&lt;p&gt;We can address these issues by scanning our image repositories and configuration files either on a daily basis or when new vulnerabilities are discovered (AWS Inspector can &lt;a target=&quot;_blank&quot; href=&quot;https://docs.aws.amazon.com/inspector/latest/user/scanning-ecr.html&quot;&gt;do it for us&lt;/a&gt; for images stored in AWS ECR). Subsequently, we will need to identify the Kubernetes clusters where those images and configs were deployed and replace them with updated ones.&lt;/p&gt;&lt;p&gt;Tracing and automating this solution can be complex, particularly if there are numerous image repositories with many images added each day, and lots of Kubernetes clusters with different image versions deployed. Additionally, this approach requires us to keep track of all third-party images used in the clusters (such as ArgoCD, Prometheus, etc), which we not always keep in our own image repositories.&lt;/p&gt;&lt;p&gt;A different approach involves installing a component inside the cluster that scans the cluster internals both on a regular basis and each time a change is made (for example, when a Pod spec is changed). This controller should expose metrics and reports that we can utilize for alerts and monitoring. And guess what? &lt;a target=&quot;_blank&quot; href=&quot;https://aquasecurity.github.io/trivy-operator/v0.18.2/&quot;&gt;Trivy Operator&lt;/a&gt; does exactly this!&lt;/p&gt;&lt;h3 id=&quot;heading-the-trivy-operator&quot;&gt;The Trivy Operator&lt;/h3&gt;&lt;p&gt;I will not go into details of how to deploy the Trivy Operator to the cluster. The official &quot;Getting Started&quot; &lt;a target=&quot;_blank&quot; href=&quot;https://aquasecurity.github.io/trivy-operator/v0.18.3/&quot;&gt;guide&lt;/a&gt; contain examples of a number of ways to install the operator.&lt;/p&gt;&lt;p&gt;The operator manages a &lt;a target=&quot;_blank&quot; href=&quot;https://kubernetes.io/docs/concepts/architecture/controller/&quot;&gt;controller&lt;/a&gt; which runs a control loop. The control loop is responsible for the generation and continues update of &lt;em&gt;security report resources,&lt;/em&gt; stored as Kubernetes &lt;a target=&quot;_blank&quot; href=&quot;https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/&quot;&gt;CRDs&lt;/a&gt; which can be accessed via the Kubernetes API. The reports are generated or updated when,&lt;/p&gt;&lt;ul&gt;&lt;li&gt;&lt;p&gt;A new resource is deployed or updated within the cluster (new version for a Deployment, new StatefulSet, change in an RBAC role, etc.)&lt;/p&gt;&lt;/li&gt;&lt;li&gt;&lt;p&gt;On a recurring schedule. This is done by setting a TTL for the security reports (default to 24 hours). When the TTL expires, the reports are deleted, which causes the controller to rerun the scans and create the updated reports.&lt;/p&gt;&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;The Trivy Operator generates a number of reports, some of them are,&lt;/p&gt;&lt;p&gt;&lt;img src=&quot;https://cdn.hashnode.com/res/hashnode/image/upload/v1706609852502/9149d536-9146-4fd1-8413-109602ae8d5a.png&quot; alt class=&quot;image--center mx-auto&quot; /&gt;&lt;/p&gt;&lt;p&gt;As you can see, it&apos;s truly a powerful tool that provides great visibility into the various components of the Kubernetes cluster, and enables continuous security monitoring and auditing of the cluster.&lt;/p&gt;&lt;h3 id=&quot;heading-running-the-scans-jobs&quot;&gt;Running the Scans Jobs&lt;/h3&gt;&lt;p&gt;The Trivy Operator utilizes Kubernetes &lt;a target=&quot;_blank&quot; href=&quot;https://kubernetes.io/docs/concepts/workloads/controllers/job/&quot;&gt;Jobs&lt;/a&gt; to execute the scan tasks. Each time it needs to run a scan task, it will create a Job Pod. The Job Pod runs a Trivy init-container, which will pull the Trivy Database and save it to an &lt;a target=&quot;_blank&quot; href=&quot;https://kubernetes.io/docs/concepts/storage/volumes/#emptydir&quot;&gt;emptyDir&lt;/a&gt; volume shared by all the containers of the Pod.&lt;/p&gt;&lt;p&gt;Next, if for example the Job is tasked with scanning the containers of a newly deployed application Pod, it will pull the Pod&apos;s containers (think of a Deployment Pod with an init-container, an application container, and maybe a sidecar) into the Job Pod. Then it will run the Trivy CLI in each container, using the Trivy Database in the emptyDir volume. Finally, the resulting logs of each Trivy CLI run will be aggregated into a security report (&lt;strong&gt;Image 3&lt;/strong&gt;).&lt;/p&gt;&lt;p&gt;&lt;img src=&quot;https://cdn.hashnode.com/res/hashnode/image/upload/v1706719090709/5fd75c18-4861-4531-b58e-cd1c2b03f138.png&quot; alt class=&quot;image--center mx-auto&quot; /&gt;&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Image 3. Trivy Operator uses Kubernetes Jobs for scanning containers&lt;/strong&gt;&lt;/p&gt;&lt;p&gt;I would like to offer you two tips. The first one is related to the Kubernetes nodes that the scan jobs run on; the second is related to isolated environments and/or large concurrent scan jobs.&lt;/p&gt;&lt;p&gt;At first, I tried to reuse nodes that I already had workloads running on and so tried to limit the CPU and memory usage of the scan Job Pod as much as possible. This resulted in &lt;a target=&quot;_blank&quot; href=&quot;https://en.wikipedia.org/wiki/Out_of_memory&quot;&gt;OOM&lt;/a&gt; (Out of Memory) issues.&lt;/p&gt;&lt;p&gt;To overcome this, I decided to run the scan jobs on separated nodes. As I&apos;m using EKS and &lt;a target=&quot;_blank&quot; href=&quot;https://karpenter.sh/&quot;&gt;Karpenter&lt;/a&gt; to manage most of the cluster nodes, I have created a separate NodePool that is dedicated only to Trivy Operator Jobs. Then, I have set the &lt;code&gt;scanJobNodeSelector&lt;/code&gt; in the Helm values file of the Trivy Operator to select only the nodes of this specific NodePool.&lt;/p&gt;&lt;p&gt;When no jobs are running, Karpenter keeps the size of the NodePool at zero (money saved!). Once there is a need for scan jobs, Karpenter will see the Pending Pods and will start as many nodes as needed (the NodePool is based on &lt;a target=&quot;_blank&quot; href=&quot;https://aws.amazon.com/ec2/spot/details/&quot;&gt;Spot Instance&lt;/a&gt; only, so money saved again!). Once all jobs are done, Karpenter will again scale the NodePool to zero. This way, I can give the scan Job Pods more memory and CPU without impacting other running workloads in the cluster.&lt;/p&gt;&lt;p&gt;Second tip relates to situations where the scan jobs need to run on isolated nodes with no direct Internet access. By default, the scan jobs run in &lt;a target=&quot;_blank&quot; href=&quot;https://aquasecurity.github.io/trivy-operator/latest/docs/vulnerability-scanning/trivy/#standalone&quot;&gt;Standalone&lt;/a&gt; mode, where each job pulls the Trivy Database from GitHub. If the scan job cannot reach GitHub, it cannot start. Moreover, even if Internet connection is available, but there are large amounts of concurrent scan jobs launched at the same time, it can lead to rate limit issues with GitHub.&lt;/p&gt;&lt;p&gt;To address this issue, we may employ Trivy in its &lt;a target=&quot;_blank&quot; href=&quot;https://aquasecurity.github.io/trivy-operator/latest/docs/vulnerability-scanning/trivy/#clientserver&quot;&gt;ClientServer&lt;/a&gt; mode. In this approach, we deploy another component of the Trivy Operator called the trivy-server. The trivy-server is responsible for updating the Trivy Database on a regular basis. As such, its the only component requiring Internet access. Furthermore, the scanning job Pods will not scan the images themselves. Instead, they will send a reference of the image to the trivy-server, which then conducts the scan and relays the findings.&lt;/p&gt;&lt;h3 id=&quot;heading-monitoring-and-alerting&quot;&gt;Monitoring and Alerting&lt;/h3&gt;&lt;p&gt;The Trivy Operator exposes the results of its scanning in the form of &lt;a target=&quot;_blank&quot; href=&quot;https://aquasecurity.github.io/trivy-operator/v0.18.3/tutorials/integrations/metrics/&quot;&gt;Prometheus Metrics&lt;/a&gt;. This enables visibility into the state of cluster vulnerabilities and misconfigurations. There is a pre-made &lt;a target=&quot;_blank&quot; href=&quot;https://aquasecurity.github.io/trivy-operator/v0.18.3/tutorials/grafana-dashboard/&quot;&gt;Grafana Dashboard&lt;/a&gt; that visualizes the metrics in a nice way (&lt;strong&gt;Image 4&lt;/strong&gt;).&lt;/p&gt;&lt;p&gt;&lt;img src=&quot;https://cdn.hashnode.com/res/hashnode/image/upload/v1706775472709/9bafc319-19fe-4158-b70b-3e13461a9c60.png&quot; alt class=&quot;image--center mx-auto&quot; /&gt;&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Image 4. A sample of the Trivy Operator Grafana Dashboard&lt;/strong&gt;&lt;/p&gt;&lt;p&gt;The real power comes with the ability to set alerts based on the metrics. For example, we can set an alert that triggers each time the value of the &lt;code&gt;trivy_image_exposedsecret&lt;/code&gt; metric is greater than zero, notifying us when there is a potential exposed secret. Also, as the Operator re-scans the cluster on a daily basis, we can set alerts to trigger when there is an increase in the number of critical vulnerabilities compared to yesterday.&lt;/p&gt;&lt;p&gt;As you can see, we can craft a very effective vulnerability and misconfiguration monitoring system based on the metrics and deploy it to a number of clusters, which will provide a very good grip on security.&lt;/p&gt;&lt;p&gt;Unfortunately, as of now, the Trivy Operator lacks a frontend interface that we can use to see detailed information about the vulnerabilities and misconfigurations found. However, if you are using &lt;a target=&quot;_blank&quot; href=&quot;https://k8slens.dev/&quot;&gt;Lens&lt;/a&gt;, there is a Trivy Operator &lt;a target=&quot;_blank&quot; href=&quot;https://github.com/aquasecurity/trivy-operator-lens-extension&quot;&gt;extension&lt;/a&gt; that you can add to Lens. The extension provides nice and detailed insight to the issues found and can help investigating alerts sent based on the Prometheus metrics.&lt;/p&gt;&lt;h2 id=&quot;heading-to-sum-things-up&quot;&gt;To Sum Things Up&lt;/h2&gt;&lt;p&gt;Either running as part of our CI/CD pipelines or as a Kubernetes Operator inside our clusters, Trivy is a great enabler in implementing DevSecOps and maintaining as secure an infrastructure as possible. Shifting left with Trivy will help prevent issues from reaching our clusters; shifting right again will help us detect and mitigate issues inside our clusters.&lt;/p&gt;&lt;p&gt;Have fun!&lt;/p&gt;]]&gt;</content:encoded><hashnode:content>&lt;![CDATA[&lt;h2 id=&quot;heading-introduction&quot;&gt;Introduction&lt;/h2&gt;&lt;p&gt;A fundamental principle that I have come to appreciate, after building CI/CD pipelines for some time, is the importance of identifying any potential issues at the earliest possible stage. This way, the team can spot and fix the issues before they leak into development or production environments. And guess what? This approach even has name; it&apos;s known as &quot;&lt;a target=&quot;_blank&quot; href=&quot;https://www.aquasec.com/cloud-native-academy/devsecops/shift-left-devops/&quot;&gt;shift left&lt;/a&gt;&quot;.&lt;/p&gt;&lt;p&gt;Unit tests are the classic example of the &quot;shift left&quot; paradigm (&lt;strong&gt;Image 1&lt;/strong&gt;). Your &lt;strong&gt;Continuous Integration&lt;/strong&gt; (CI) system pulls the code from your favorite source control and runs a bunch of unit-tests to make sure that the changes made to the code will not break the application. If any of the tests fail, the CI pipeline will break, and a notification will likely be sent for the team to be aware that the new version has issues. The end goal is to prevent a potential buggy version from being deployed to the development environments, or worse, to production.&lt;/p&gt;&lt;p&gt;&lt;img src=&quot;https://cdn.hashnode.com/res/hashnode/image/upload/v1705908883579/a6b2f290-1d70-4045-9e24-cadd4abfa1fc.png&quot; alt=&quot;TrivyCICD1.jpg&quot; class=&quot;image--center mx-auto&quot; /&gt;&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Image 1. CI/CD Flow of a Kubernetes application based on GitHub, Jenkins, Slack and ArgoCD&lt;/strong&gt;&lt;/p&gt;&lt;p&gt;But can we embed vulnerability and misconfiguration scanning as part of the CI in the same way we did with unit testing? Owing to the relatively small size of a microservice code base and the compact nature of the base images used for running the code (think &lt;a target=&quot;_blank&quot; href=&quot;https://www.alpinelinux.org/&quot;&gt;Alpine Linux&lt;/a&gt;), the amount of code dependencies and system packages is also modest.&lt;/p&gt;&lt;p&gt;This makes the scanning of both the base image and code dependencies for vulnerabilities and misconfigurations a feasible action in a relatively short time. Therefore, vulnerability and misconfiguration scanning can be done as part of the CI flow. We just need the right tool for the job.&lt;/p&gt;&lt;p&gt;&lt;a target=&quot;_blank&quot; href=&quot;https://trivy.dev/&quot;&gt;Trivy&lt;/a&gt; was nicknamed &quot;The Swiss Army Knife for Security Scanning&quot; and for a very good reason. It can run as a &lt;a target=&quot;_blank&quot; href=&quot;https://aquasecurity.github.io/trivy/v0.48/&quot;&gt;CLI&lt;/a&gt; tool (great for our CI needs) but also as a Kubernetes &lt;a target=&quot;_blank&quot; href=&quot;https://aquasecurity.github.io/trivy-operator/v0.18.2/&quot;&gt;Operator&lt;/a&gt; for continuous scanning of the cluster from inside. The CLI part is really handy. It digs into different parts of our application and highlights any security issues we need to know about. Trivy can scan the file systems (before we package everything inside an image), but it can also scan images (base images or our application image with all of our code inside). In general, Trivy can scan for,&lt;/p&gt;&lt;ul&gt;&lt;li&gt;&lt;p&gt;OS Packages vulnerabilities&lt;/p&gt;&lt;/li&gt;&lt;li&gt;&lt;p&gt;Code dependencies vulnerabilities&lt;/p&gt;&lt;/li&gt;&lt;li&gt;&lt;p&gt;Misconfigurations&lt;/p&gt;&lt;/li&gt;&lt;li&gt;&lt;p&gt;Secrets exposed in the code base&lt;/p&gt;&lt;/li&gt;&lt;li&gt;&lt;p&gt;Much more&lt;/p&gt;&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;Going back to the diagram in &lt;strong&gt;Image 1&lt;/strong&gt;, we can enhance our CI/CD flow and add Trivy abilities to our CI flow (&lt;strong&gt;Image 2&lt;/strong&gt;). If Trivy finds any vulnerabilities, misconfigurations or any other security issue, the build will fail and the CI system will prevent a potential insecure version from being deployed to any of our environments.&lt;/p&gt;&lt;p&gt;&lt;img src=&quot;https://cdn.hashnode.com/res/hashnode/image/upload/v1705909193043/2ade2b25-399e-4801-808e-df6a88c95550.png&quot; alt class=&quot;image--center mx-auto&quot; /&gt;&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Image 2. CI/CD Flow of a Kubernetes application based on GitHub, Jenkins, Slack, ArgoCD with Trivy scanning for issues&lt;/strong&gt;&lt;/p&gt;&lt;p&gt;Vulnerabilities are very dynamic and new ones are found by the hour (more on that in the Shift Right section). Trivy handles this by using a &lt;a target=&quot;_blank&quot; href=&quot;https://github.com/aquasecurity/trivy-db&quot;&gt;Database&lt;/a&gt; which it pulls for updates every 6 hours by default.&lt;/p&gt;&lt;h2 id=&quot;heading-shift-left-trivy-in-the-ci-flow&quot;&gt;Shift Left - Trivy in the CI Flow&lt;/h2&gt;&lt;p&gt;I will not delve into the details of how to install the Trivy CLI tool. There are many CI systems and CI services, and the installation of Trivy can differ from one to the next. The official installation documentation can be found &lt;a target=&quot;_blank&quot; href=&quot;https://aquasecurity.github.io/trivy/v0.48/getting-started/installation/&quot;&gt;here&lt;/a&gt;. A number of examples of integrations with CI systems can be found &lt;a target=&quot;_blank&quot; href=&quot;https://aquasecurity.github.io/trivy/v0.48/tutorials/integrations/&quot;&gt;here&lt;/a&gt;.&lt;/p&gt;&lt;h3 id=&quot;heading-when-to-hit-the-brakes&quot;&gt;When to Hit the Brakes?&lt;/h3&gt;&lt;p&gt;Trivy categorizes the severity of identified vulnerabilities and issues into several levels: &lt;code&gt;UNKNOWN, LOW, MEDIUM, HIGH, CRITICAL&lt;/code&gt;. One of the questions I find myself asking all the time is, &quot;When should I break the build and start searching for a fix?&quot;.&lt;/p&gt;&lt;p&gt;In a well-updated environment, there may be some low and medium vulnerabilities or issues, fewer high-level vulnerabilities or issues, and almost no critical ones. Alerting on the low and medium vulnerabilities or issues can cause much noise in the CI; this will make the CI cycle slow and delivery will be impacted (and frustrate the team). Alerting on high and critical vulnerabilities or issues should be much less noisy, without compromising too much on security (I hope).&lt;/p&gt;&lt;p&gt;So, usually I set Trivy to break the build (by sending exit code 1 to the runtime, &lt;code&gt;--exit-code 1&lt;/code&gt;) when there are high or/and critical vulnerabilities or issues (&lt;code&gt;--severity HIGH,CRITICAL&lt;/code&gt;). For example, when scanning a Dockerfile,&lt;/p&gt;&lt;pre&gt;&lt;code class=&quot;lang-bash&quot;&gt;trivy config --exit-code 1 --severity HIGH,CRITICAL ./Dockerfile&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;Another decision that needs to be made is what if Trivy spots a high or critical issue, but there&apos;s no fix for it right now? Should we stop everything and wait, or just keep going? We can instruct Trivy to ignore vulnerabilities which are found but no fix is available by setting the &lt;code&gt;--ignore-unfixed&lt;/code&gt; flag.&lt;/p&gt;&lt;p&gt;Personally, I am not in favor of using the &lt;code&gt;--ignore-unfixed&lt;/code&gt; flag by default. My preferred approach is to break the build even in the absence of a known fix for a high or critical vulnerability. Then, some research and a decision needs to be made - is this vulnerability a high risk for us or can we live with it for some time until a fix is found? If the vulnerability is not a big deal for us, the &lt;code&gt;--ignore-unfixed&lt;/code&gt; flag can be added for a specific period of time (can be just for one build or for a specific time period).&lt;/p&gt;&lt;p&gt;I use Trivy in three specific steps of the CI flow. The first two are executed before any code dependencies installation or unit tests execution. The last one is executed just before the CI pushes the new application image to the image repository.&lt;/p&gt;&lt;h3 id=&quot;heading-scanning-the-application-dockerfile-and-kubernetes-manifests-files&quot;&gt;Scanning the Application Dockerfile and Kubernetes Manifests Files&lt;/h3&gt;&lt;p&gt;The first step is to scan the Dockerfile which is used to create the application image and the Kubernetes manifests (either helm or other). Trivy should find any misconfigurations that can result in an insecure application image or risky Kubernetes objects.&lt;/p&gt;&lt;p&gt;To understand why we should do this, lets take a look at a very simple example. Lets assume that someone made a change to the Dockerfile, resulting in an application image executing with the &lt;em&gt;root&lt;/em&gt; user. Our original Dockerfile had the non privilege user &quot;my-user&quot; set,&lt;/p&gt;&lt;pre&gt;&lt;code class=&quot;lang-dockerfile&quot;&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;USER&lt;/span&gt; my-&lt;span class=&quot;hljs-keyword&quot;&gt;user&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;But then, someone accidentally switched out the directive to,&lt;/p&gt;&lt;pre&gt;&lt;code class=&quot;lang-dockerfile&quot;&gt;&lt;span class=&quot;hljs-keyword&quot;&gt;USER&lt;/span&gt; root&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;Best practice &lt;a target=&quot;_blank&quot; href=&quot;https://sysdig.com/blog/dockerfile-best-practices/&quot;&gt;dictate&lt;/a&gt; that the application container should execute with a non privilege user. So the change above is a bad practice, which can lead to much trouble. Scanning the Dockerfile with Trivy,&lt;/p&gt;&lt;pre&gt;&lt;code class=&quot;lang-bash&quot;&gt;trivy config --exit-code 1 --severity HIGH,CRITICAL ./Dockerfile&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;Will result in,&lt;/p&gt;&lt;pre&gt;&lt;code class=&quot;lang-bash&quot;&gt;2024-01-23T12:17:19.686+0200    INFO    Misconfiguration scanning is enabled2024-01-23T12:17:20.933+0200    INFO    Detected config files: 1Dockerfile (dockerfile)Tests: 19 (SUCCESSES: 18, FAILURES: 1, EXCEPTIONS: 0)Failures: 1 (HIGH: 1, CRITICAL: 0)HIGH: Last USER &lt;span class=&quot;hljs-built_in&quot;&gt;command&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;in&lt;/span&gt; Dockerfile should not be &lt;span class=&quot;hljs-string&quot;&gt;&apos;root&apos;&lt;/span&gt;Running containers with &lt;span class=&quot;hljs-string&quot;&gt;&apos;root&apos;&lt;/span&gt; user can lead to a container escape situation. It is a best practice to run containers as non-root users, &lt;span class=&quot;hljs-built_in&quot;&gt;which&lt;/span&gt; can be &lt;span class=&quot;hljs-keyword&quot;&gt;done&lt;/span&gt; by adding a &lt;span class=&quot;hljs-string&quot;&gt;&apos;USER&apos;&lt;/span&gt; statement to the Dockerfile.See https://avd.aquasec.com/misconfig/ds002&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;We can do the same with Kubernetes manifests. For example, running the following Trivy scan,&lt;/p&gt;&lt;pre&gt;&lt;code class=&quot;lang-bash&quot;&gt;trivy config --exit-code 1 --severity HIGH,CRITICAL ./helm-charts&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;Resulted in,&lt;/p&gt;&lt;pre&gt;&lt;code class=&quot;lang-bash&quot;&gt;HIGH: Container &lt;span class=&quot;hljs-string&quot;&gt;&apos;my-app&apos;&lt;/span&gt; of Deployment &lt;span class=&quot;hljs-string&quot;&gt;&apos;my-deployment&apos;&lt;/span&gt; should &lt;span class=&quot;hljs-built_in&quot;&gt;set&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;&apos;securityContext.readOnlyRootFilesystem&apos;&lt;/span&gt; to &lt;span class=&quot;hljs-literal&quot;&gt;true&lt;/span&gt;An immutable root file system prevents applications from writing to their &lt;span class=&quot;hljs-built_in&quot;&gt;local&lt;/span&gt; disk. This can &lt;span class=&quot;hljs-built_in&quot;&gt;limit&lt;/span&gt; intrusions, as attackers will not be able to tamper with the file system or write foreign executables to disk.See https://avd.aquasec.com/misconfig/ksv014&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;So, if Trivy finds any issues with the Dockerfile or Kubernetes manifests the build will break.&lt;/p&gt;&lt;h3 id=&quot;heading-scanning-the-application-dependencies&quot;&gt;Scanning the Application &lt;strong&gt;Dependencies&lt;/strong&gt;&lt;/h3&gt;&lt;p&gt;The second step is to scan the application dependencies for vulnerabilities. Trivy supports a large number of &lt;a target=&quot;_blank&quot; href=&quot;https://aquasecurity.github.io/trivy/v0.48/docs/coverage/language/#supported-languages&quot;&gt;programming languages&lt;/a&gt;. This is done by running,&lt;/p&gt;&lt;pre&gt;&lt;code class=&quot;lang-bash&quot;&gt;trivy filesystem --scanners vuln --include-dev-deps --exit-code 1 --severity HIGH,CRITICAL .&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;Note the &lt;code&gt;--include-dev-deps&lt;/code&gt; flag. By default, Trivy will not scan dev-dependencies (like dependencies specified in the &lt;code&gt;devDependencies&lt;/code&gt; section in a npm &lt;code&gt;package.json&lt;/code&gt; file). For me, this is an issue. Although these dependencies will not be part of the final image, they will get installed and used as part of the testing steps. Having dependencies with vulnerabilities used on your CI system can lead to bad news and trouble. If Trivy flags any issues with any of the dependencies, the build will stop just in time.&lt;/p&gt;&lt;h3 id=&quot;heading-scanning-the-final-application-image&quot;&gt;Scanning the Final Application Image&lt;/h3&gt;&lt;p&gt;The third step is to scan the final application image. This happens after all the unit tests were executed and the Dockerfile was used to package the application inside a base image resulting in an application image ready to be deployed.&lt;/p&gt;&lt;p&gt;This step will find issues and vulnerabilities in a number of potential hidden spots,&lt;/p&gt;&lt;ul&gt;&lt;li&gt;&lt;p&gt;The base image itself can have outdated OS packages with vulnerabilities which didn&apos;t exist when the base image was constructed and scanned&lt;/p&gt;&lt;/li&gt;&lt;li&gt;&lt;p&gt;The Dockerfile can install outdated OS packages to the base image, which will introduce OS package vulnerabilities in the final image&lt;/p&gt;&lt;/li&gt;&lt;li&gt;&lt;p&gt;The Dockerfile is capable of installing application dependencies that are beyond the scope of the code base itself, such as executing &lt;code&gt;npm or pip install&lt;/code&gt; commands within the Dockerfile. These external dependencies may introduce potential vulnerabilities.&lt;/p&gt;&lt;/li&gt;&lt;li&gt;&lt;p&gt;Secrets and passwords can find their way into the final application image. This could happen in a few ways, like through the Dockerfile, getting mixed into the base image, or some other route.&lt;/p&gt;&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;The scan itself is done by running,&lt;/p&gt;&lt;pre&gt;&lt;code class=&quot;lang-bash&quot;&gt;trivy image --exit-code 1 --severity HIGH,CRITICAL [IMAGE_NAME]&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;If Trivy spots any issues with the final image, the build will stop right there, preventing the problematic application image from sneaking into the image repository and ending up in one of the Kubernetes clusters.&lt;/p&gt;&lt;h2 id=&quot;heading-shift-right-trivy-inside-the-cluster&quot;&gt;Shift Right - Trivy Inside the Cluster&lt;/h2&gt;&lt;p&gt;Unit tests are coupled to a code version. As long as no changes are introduced to the code, the results of the tests should remain the same. This means that when the tests pass and the code is deployed, we don&apos;t need to rerun the unit tests to assure the stability of the code over time.&lt;/p&gt;&lt;p&gt;Vulnerabilities are different. An application image with no vulnerabilities today may have critical ones tomorrow. Because we usually do not replace &lt;strong&gt;all&lt;/strong&gt; of our running containers inside a Kubernetes cluster on a daily basis (and sometimes not even for weeks or months), the containers images used may accumulate vulnerabilities over time that we are not aware of.&lt;/p&gt;&lt;p&gt;Same goes for configuration files. New configuration best practices are published based on issues and vulnerabilities discovered. So, a configuration files scan done today may miss some new best practices recommended tomorrow (I assume that you never make manual configuration changes, and everything goes via a CI/CD flow).&lt;/p&gt;&lt;p&gt;We can address these issues by scanning our image repositories and configuration files either on a daily basis or when new vulnerabilities are discovered (AWS Inspector can &lt;a target=&quot;_blank&quot; href=&quot;https://docs.aws.amazon.com/inspector/latest/user/scanning-ecr.html&quot;&gt;do it for us&lt;/a&gt; for images stored in AWS ECR). Subsequently, we will need to identify the Kubernetes clusters where those images and configs were deployed and replace them with updated ones.&lt;/p&gt;&lt;p&gt;Tracing and automating this solution can be complex, particularly if there are numerous image repositories with many images added each day, and lots of Kubernetes clusters with different image versions deployed. Additionally, this approach requires us to keep track of all third-party images used in the clusters (such as ArgoCD, Prometheus, etc), which we not always keep in our own image repositories.&lt;/p&gt;&lt;p&gt;A different approach involves installing a component inside the cluster that scans the cluster internals both on a regular basis and each time a change is made (for example, when a Pod spec is changed). This controller should expose metrics and reports that we can utilize for alerts and monitoring. And guess what? &lt;a target=&quot;_blank&quot; href=&quot;https://aquasecurity.github.io/trivy-operator/v0.18.2/&quot;&gt;Trivy Operator&lt;/a&gt; does exactly this!&lt;/p&gt;&lt;h3 id=&quot;heading-the-trivy-operator&quot;&gt;The Trivy Operator&lt;/h3&gt;&lt;p&gt;I will not go into details of how to deploy the Trivy Operator to the cluster. The official &quot;Getting Started&quot; &lt;a target=&quot;_blank&quot; href=&quot;https://aquasecurity.github.io/trivy-operator/v0.18.3/&quot;&gt;guide&lt;/a&gt; contain examples of a number of ways to install the operator.&lt;/p&gt;&lt;p&gt;The operator manages a &lt;a target=&quot;_blank&quot; href=&quot;https://kubernetes.io/docs/concepts/architecture/controller/&quot;&gt;controller&lt;/a&gt; which runs a control loop. The control loop is responsible for the generation and continues update of &lt;em&gt;security report resources,&lt;/em&gt; stored as Kubernetes &lt;a target=&quot;_blank&quot; href=&quot;https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/&quot;&gt;CRDs&lt;/a&gt; which can be accessed via the Kubernetes API. The reports are generated or updated when,&lt;/p&gt;&lt;ul&gt;&lt;li&gt;&lt;p&gt;A new resource is deployed or updated within the cluster (new version for a Deployment, new StatefulSet, change in an RBAC role, etc.)&lt;/p&gt;&lt;/li&gt;&lt;li&gt;&lt;p&gt;On a recurring schedule. This is done by setting a TTL for the security reports (default to 24 hours). When the TTL expires, the reports are deleted, which causes the controller to rerun the scans and create the updated reports.&lt;/p&gt;&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;The Trivy Operator generates a number of reports, some of them are,&lt;/p&gt;&lt;p&gt;&lt;img src=&quot;https://cdn.hashnode.com/res/hashnode/image/upload/v1706609852502/9149d536-9146-4fd1-8413-109602ae8d5a.png&quot; alt class=&quot;image--center mx-auto&quot; /&gt;&lt;/p&gt;&lt;p&gt;As you can see, it&apos;s truly a powerful tool that provides great visibility into the various components of the Kubernetes cluster, and enables continuous security monitoring and auditing of the cluster.&lt;/p&gt;&lt;h3 id=&quot;heading-running-the-scans-jobs&quot;&gt;Running the Scans Jobs&lt;/h3&gt;&lt;p&gt;The Trivy Operator utilizes Kubernetes &lt;a target=&quot;_blank&quot; href=&quot;https://kubernetes.io/docs/concepts/workloads/controllers/job/&quot;&gt;Jobs&lt;/a&gt; to execute the scan tasks. Each time it needs to run a scan task, it will create a Job Pod. The Job Pod runs a Trivy init-container, which will pull the Trivy Database and save it to an &lt;a target=&quot;_blank&quot; href=&quot;https://kubernetes.io/docs/concepts/storage/volumes/#emptydir&quot;&gt;emptyDir&lt;/a&gt; volume shared by all the containers of the Pod.&lt;/p&gt;&lt;p&gt;Next, if for example the Job is tasked with scanning the containers of a newly deployed application Pod, it will pull the Pod&apos;s containers (think of a Deployment Pod with an init-container, an application container, and maybe a sidecar) into the Job Pod. Then it will run the Trivy CLI in each container, using the Trivy Database in the emptyDir volume. Finally, the resulting logs of each Trivy CLI run will be aggregated into a security report (&lt;strong&gt;Image 3&lt;/strong&gt;).&lt;/p&gt;&lt;p&gt;&lt;img src=&quot;https://cdn.hashnode.com/res/hashnode/image/upload/v1706719090709/5fd75c18-4861-4531-b58e-cd1c2b03f138.png&quot; alt class=&quot;image--center mx-auto&quot; /&gt;&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Image 3. Trivy Operator uses Kubernetes Jobs for scanning containers&lt;/strong&gt;&lt;/p&gt;&lt;p&gt;I would like to offer you two tips. The first one is related to the Kubernetes nodes that the scan jobs run on; the second is related to isolated environments and/or large concurrent scan jobs.&lt;/p&gt;&lt;p&gt;At first, I tried to reuse nodes that I already had workloads running on and so tried to limit the CPU and memory usage of the scan Job Pod as much as possible. This resulted in &lt;a target=&quot;_blank&quot; href=&quot;https://en.wikipedia.org/wiki/Out_of_memory&quot;&gt;OOM&lt;/a&gt; (Out of Memory) issues.&lt;/p&gt;&lt;p&gt;To overcome this, I decided to run the scan jobs on separated nodes. As I&apos;m using EKS and &lt;a target=&quot;_blank&quot; href=&quot;https://karpenter.sh/&quot;&gt;Karpenter&lt;/a&gt; to manage most of the cluster nodes, I have created a separate NodePool that is dedicated only to Trivy Operator Jobs. Then, I have set the &lt;code&gt;scanJobNodeSelector&lt;/code&gt; in the Helm values file of the Trivy Operator to select only the nodes of this specific NodePool.&lt;/p&gt;&lt;p&gt;When no jobs are running, Karpenter keeps the size of the NodePool at zero (money saved!). Once there is a need for scan jobs, Karpenter will see the Pending Pods and will start as many nodes as needed (the NodePool is based on &lt;a target=&quot;_blank&quot; href=&quot;https://aws.amazon.com/ec2/spot/details/&quot;&gt;Spot Instance&lt;/a&gt; only, so money saved again!). Once all jobs are done, Karpenter will again scale the NodePool to zero. This way, I can give the scan Job Pods more memory and CPU without impacting other running workloads in the cluster.&lt;/p&gt;&lt;p&gt;Second tip relates to situations where the scan jobs need to run on isolated nodes with no direct Internet access. By default, the scan jobs run in &lt;a target=&quot;_blank&quot; href=&quot;https://aquasecurity.github.io/trivy-operator/latest/docs/vulnerability-scanning/trivy/#standalone&quot;&gt;Standalone&lt;/a&gt; mode, where each job pulls the Trivy Database from GitHub. If the scan job cannot reach GitHub, it cannot start. Moreover, even if Internet connection is available, but there are large amounts of concurrent scan jobs launched at the same time, it can lead to rate limit issues with GitHub.&lt;/p&gt;&lt;p&gt;To address this issue, we may employ Trivy in its &lt;a target=&quot;_blank&quot; href=&quot;https://aquasecurity.github.io/trivy-operator/latest/docs/vulnerability-scanning/trivy/#clientserver&quot;&gt;ClientServer&lt;/a&gt; mode. In this approach, we deploy another component of the Trivy Operator called the trivy-server. The trivy-server is responsible for updating the Trivy Database on a regular basis. As such, its the only component requiring Internet access. Furthermore, the scanning job Pods will not scan the images themselves. Instead, they will send a reference of the image to the trivy-server, which then conducts the scan and relays the findings.&lt;/p&gt;&lt;h3 id=&quot;heading-monitoring-and-alerting&quot;&gt;Monitoring and Alerting&lt;/h3&gt;&lt;p&gt;The Trivy Operator exposes the results of its scanning in the form of &lt;a target=&quot;_blank&quot; href=&quot;https://aquasecurity.github.io/trivy-operator/v0.18.3/tutorials/integrations/metrics/&quot;&gt;Prometheus Metrics&lt;/a&gt;. This enables visibility into the state of cluster vulnerabilities and misconfigurations. There is a pre-made &lt;a target=&quot;_blank&quot; href=&quot;https://aquasecurity.github.io/trivy-operator/v0.18.3/tutorials/grafana-dashboard/&quot;&gt;Grafana Dashboard&lt;/a&gt; that visualizes the metrics in a nice way (&lt;strong&gt;Image 4&lt;/strong&gt;).&lt;/p&gt;&lt;p&gt;&lt;img src=&quot;https://cdn.hashnode.com/res/hashnode/image/upload/v1706775472709/9bafc319-19fe-4158-b70b-3e13461a9c60.png&quot; alt class=&quot;image--center mx-auto&quot; /&gt;&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Image 4. A sample of the Trivy Operator Grafana Dashboard&lt;/strong&gt;&lt;/p&gt;&lt;p&gt;The real power comes with the ability to set alerts based on the metrics. For example, we can set an alert that triggers each time the value of the &lt;code&gt;trivy_image_exposedsecret&lt;/code&gt; metric is greater than zero, notifying us when there is a potential exposed secret. Also, as the Operator re-scans the cluster on a daily basis, we can set alerts to trigger when there is an increase in the number of critical vulnerabilities compared to yesterday.&lt;/p&gt;&lt;p&gt;As you can see, we can craft a very effective vulnerability and misconfiguration monitoring system based on the metrics and deploy it to a number of clusters, which will provide a very good grip on security.&lt;/p&gt;&lt;p&gt;Unfortunately, as of now, the Trivy Operator lacks a frontend interface that we can use to see detailed information about the vulnerabilities and misconfigurations found. However, if you are using &lt;a target=&quot;_blank&quot; href=&quot;https://k8slens.dev/&quot;&gt;Lens&lt;/a&gt;, there is a Trivy Operator &lt;a target=&quot;_blank&quot; href=&quot;https://github.com/aquasecurity/trivy-operator-lens-extension&quot;&gt;extension&lt;/a&gt; that you can add to Lens. The extension provides nice and detailed insight to the issues found and can help investigating alerts sent based on the Prometheus metrics.&lt;/p&gt;&lt;h2 id=&quot;heading-to-sum-things-up&quot;&gt;To Sum Things Up&lt;/h2&gt;&lt;p&gt;Either running as part of our CI/CD pipelines or as a Kubernetes Operator inside our clusters, Trivy is a great enabler in implementing DevSecOps and maintaining as secure an infrastructure as possible. Shifting left with Trivy will help prevent issues from reaching our clusters; shifting right again will help us detect and mitigate issues inside our clusters.&lt;/p&gt;&lt;p&gt;Have fun!&lt;/p&gt;]]&gt;</hashnode:content><hashnode:coverImage>https://cdn.hashnode.com/res/hashnode/image/stock/unsplash/u0vgcIOQG08/upload/80baa5406b0fa21fe54945cd9f9b9c53.jpeg</hashnode:coverImage></item></channel></rss>