OCI FortiGate HA Cluster – Reference Architecture: Code review & Fixes

Intro

OCI Quick Start repositories on GitHub are collections of Terraform scripts and configurations provided by Oracle. These repositories are designed to help organizations quickly deploy common infrastructure setups on the OCI Platform. Each Quick Start focuses on a specific use case or workload, simplifying the process of provisioning on OCI using Terraform—a sort of IaC-based reference architecture.

Today, we will code review one of those reference architectures, which is a Fortinet firewall solution deployed in OCI.
Note: This article won’t discuss the architecture, but will rather address its Terraform code flaws and fixes.

Why Some Errors Never Get to Your OCI Resource Manager Stack

Certain Terraform errors may not reach your RM stack due to its design. For instance, RM allows the hardcoding of specific variables, like availability domains, directly in its interface. This sidesteps the need for these variables to be checked by native conditions in the TF code.

Moreover, RM reads these variables from the schema.yaml file, altering the behavior compared to local Terraform CLI execution. This approach can result in certain errors being handled or bypassed within the RM environment, creating a distinction from standard Terraform workflows.

The Stack: FortiGate HA Cluster using DRG – Reference Architecture


The stack is a result of the collaboration of both Oracle and Fortinet. This architecture is based on a Hub & Spoke topology, using FortiGate firewall from OCI Marketplace. I actually deployed it while working on one of my projects.

For details of the architecture, see Set up a hub-and-spoke network topology.

 

The Repository

You will find this Terraform configuration under the main OCI-Fortinet GitHub repository, but not in the root directory. The folder in question is drg-ha-use-case under: oracle-quickstart/oci-fortinet/use-cases/drg-ha-use-case.

The Errors

At the time of writing this, the errors were still not fixed despite opening issues and sharing the fix. You can see that the last commit goes back to 2 years. You will need to clone the repo and navigate to the drg-ha-use-case subdirectory:

$ git clone https://github.com/oracle-quickstart/oci-fortinet.git
$ cd oracle-quickstart/oci-fortinet/use-cases/drg-ha-use-case
$ terraform init

1. Data Source Error in Regions with Unique AD

You will face this issue in a region with only one availability domain (e.g., ca-toronto-1) as the data source of the availability domain will fail the Terraform execution plan.

 

CAUSE:  See issue #8 

  • In the above error, Terraform complains about the availability data source having only one element. This impacts two of the oci_core_instance resource blocks (2 web VMs, 2 DB VMs).
    • File: compute.tf
    • Lines: 235 & 276

    Problem:

    The count.index for the data source block will always be equal to 0 in single AD regions (1 element).

    • File: data_source.tf
    • Lines: 8-10

    This configuration hasn’t been tested in single AD regions.

    $ vi data_source.tf
    # —— Get list of availability domains
    data “oci_identity_availability_domains” “ADs” {
    compartment_id = var.tenancy_ocid
    }

    Reason:

    In Terraform, the count.index always starts at 0. If you have a resource with a count of 4, the count.index object will be 0, 1, 2, and 3.

    Let’s take, for example, the “web-vms” oci_core_instance block in compute.tf at line 235:

If we run the condition:

  • The variable availability_domain_name is empty.
  • The ads data source length is 1 element. That means that the AD name will be equal to ads data source collection with an index value of [0+1] = 1.

data.oci_identity_availability_domains.ADs.availability_domains[1] doesn’t exist as it only contains 1 element.

Solution

Complete the full availability domain conditional expression on line 235 and line 276 (web-vms/db-vms). Add the case where the data source ads.availability_domains has 1 element (the region has one AD only).

 

Bad Logic

Seeking the name of the count.index+1 availability domain is still wrong when the region has more than 1 AD. For example, say you want to create 3 VMs and your region has 2 Availability Domains:

  • The first iteration [0] will set count.index+1 = 1 (2nd data source element = AD2).
  • Then the second iteration sets count.index+1 = 2 (3rd data source element = AD3).

The 2nd and 3rd iteration will always fail because there are only 2 ADs (index list [0,1]).

2. Wrong Compartment Argument in the Security List Data Sources

Another issue you will run into is a failure to deploy subnets due to the data source collection being empty (no element).

 

CAUSE: See issue #9

In the above error, Terraform complains that the allow_all_security data source is empty. This impacts all FortiGate subnet blocks in the config as they all share the same security lists.

  • File: network.tf
  • Line: 240 & more

Reason:

In this configuration, there are 2 compartments: one for compute and another for network resources. If you take a look at the allow_all_security block in datasource.tf (lines 64 to 74), you’ll notice a wrong compartment ID in the security lists data source (compute instead of network).

Solution

This was a silly mistake, but it took me a day to figure out while delving through a pile of new Terraform files. All you need to do is replace the compute compartment variable with var.network_compartment_ocid.

Edit network.tf lines 64-74:

# —— Get the Allow All Security Lists for Subnets in Firewall VCN
data “oci_core_security_lists” “allow_all_security” {
compartment_id = var.network_compartment_ocid # CORRECT Compartment
vcn_id = local.use_existing_network ? var.vcn_id : oci_core_vcn.hub.0.id

}

3. More Code Inconsistencies

I wasn’t done debugging as I found other misplaced compartment variables in some VNIC attachments data sources.

  • File: datasource.tf
  • Lines: 103-115 & 118-130

You need to replace them with var.compute_compartment_ocid.

Conclusion & Recommendations

This type of undetected code issue is why I never trust the first deployment in Resource Manager. To avoid problems in the future, especially if you decide to migrate out of RM at some point, I suggest the following workflow:

  1. Run locally and validate any code bugs.
  2. Run on Resource Manager.
  3. Store to a git repo (blueprint with eventual versioning).

I hope this was helpful as the issues I opened have remained unsolved for over a year in their GitHub repo.