I’m working with a client condemned to graced with the privilege of using the Azure Cloud platform. Whilst I have years of experience working with AWS and GCP, my last exposure to Azure goes back to 2017. Memories from that time before Corona are faint, the world was a different place back then. But some things do not change.

On Managed Identities

…without having to manage any credentials? The sound of it is pleasant to the ears, and off we go taking this approach to setup an Azure Database for PosgreSQL - Flexible Server. The name is long, the process a bit involved, but in the end the identity is there, managed.

The application cries for a database schema, alas: “Azure AD Admin unable to run create schema”, we learn. A database user must be provided.

The application is modern, it follows trends, it is deployed on Azure Container Apps, hence it shall be given an Init Container. Simple, small, elegant. The Init Container has only one purpose in its short life: create a database user for the application and then exit, zero.

And then it exists but exits one. And yet the incantation looks the part:

1
export PGPASSWORD=`curl -s 'http://169.254.169.254/metadata/identity/oauth2/token?api-version=2018-02-01&resource=https%3A%2F%2Fossrdbms-aad.database.windows.net&client_id=$CLIENT_ID' -H Metadata:true | jq -r .access_token`

Nothing is to do, no variation of the incantation has effect. Only later do we learn that Init Containers and Managed Identities do not coexist in the same space as yet.

The application still cries for a schema. It is now packaged with a wrapper script that creates the database user for the application. Functionality is more important than form.

But lack of form does not imply functionality and pg_hba.conf rejects connection for host 10.0.0.52.

There is nothing to do but to wonder, after enough wondering we learn what others have learned before us: Under the authentication for the database you had allowed Azure active directory authentication only and that was the reason the standalone users (Postgres authentication) were not working

Ouroboros smiles, we turn on authentication with user credentials. But had we not set out to connect to the database

On Frontdoor and Container Apps

Frontdoor is a CDN that also provides security capabilities through its Web Application Firewall.

Container Apps allow to deploy containers without the usual rain dance configuration overhead and management tasks required when using bare-bones kubernetes.

I am of the uttermost conviction that these two are meant for each other. Fully-managed containers exposed to the world - it is a match made in heaven.

In order for Front Door to talk to Container Apps, I need to provision a Private Link Service. This allows to create a connection (a Private Link) between the Front Door origin group and… well, and between a load balancer in Container Apps. This is where things start to get a bit less clear. The origin group should be pointing to a load balancer in Container Apps, but I have not created one and there’s no indication anywhere in the Azure Portal or its documentation what this load balancer should be.

Luckily, there’s a guide for Integrating Azure Front Door WAF with Azure Container Apps. In the comment section of this article, Chris from Microsoft writes that “Yes, when you create an internal ACA environment, it always creates a load balancer with the name ‘kubernetes-internal’.”

I say the name out aloud: kubernetes-internal. Kubernetes… internal. It sounds private, hidden. This is ancient knowledge, something mere mortals like me should not know about.

It haunts me in my dreams. I re-read the architecture documentation of Container Apps, there is no mention of It-That-Must-Not-Be-Named. The only piece of knowledge I find is about the resource group this mysterious load balancer is created in: “The name of the resource group created in the Azure subscription where your environment is hosted is prefixed with MC_ by default, and the resource group name can’t be customized when you create a container app”.

The feeling of unease is only growing stronger. Is this really The Way? Am I not risking to awaken The Spirits by accessing such hidden resources? Aren’t internal things prone to change? Could my infrastructure code stop working overnight because Todd at Microsoft woke up one day and decided that today was a good day to rename the load balancer to “mighty-internal-unicorn”?

My mind is full of questions, doubts and hesitation. But I am not alone. A self-help group has formed around issue 402 of the Container Apps GitHub repository. Since September 13, 2022, engineers from all around the world are debating whether this is really The Right Way. If we’re allowed to do this. If this is officially supported by Microsoft. Mike writes: “No business is going to deploy something like this to production for 24/7 business critical applications if the workarounds suggested are not recommended by Microsoft Azure.”

But the client waits. Something needs to happen. There seems to be no other way.

At last, trembling, I type in the forbidden runes:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
data "azurerm_resources" "load-balancer-resource" {
  type = "Microsoft.Network/loadBalancers"

  depends_on = [azurerm_container_app_environment.container-app-env]

  name = "kubernetes-internal"
  required_tags = {
    "managed environment name" = azurerm_container_app_environment.container-app-env[0].name
  }
}

data "azurerm_lb" "internal-load-balancer" {
  name                = data.azurerm_resources.load-balancer-resource.resources[0].name
  resource_group_name = data.azurerm_resources.load-balancer-resource.resources[0].resource_group_name
}

With the internal-load-balancer ID in hand, I proceed to creating the private link service. It works.

I wait. Nothing else happens. A few hours later, the service is still reachable. The work is done, but there is no sense of relief.

The doubts remain.

Container Apps and Key Vault secrets

We keep secrets in Key Vault, because that is the natural place to use. Applications deployed on Container Apps need secrets. To pass secret values to a container app, you need to define a secret in the container app (not the same as a Key Vault secret) and then create an environment variable that references this secret.

Up until recently, it was not possible to reference secrets from a Key Vault. So the workaround consisted in reading the secret value from Key Vault and create a new secret in the container app. And to not forget to repeat the process if the underlying value change.

So when the feature trickled down into the terraform provider for Azure, I happily set out to adopt it. But my happiness soon turned into rage and despair, for the update had created a rift in the fabric of Azure: the container app was now in a limbo state which took some time to notice.

In the Azure Portal, the application appeared as healthy. It was still possible to submit modifications or update the container image (via the Portal, terraform or the API) which also seemed to succeed (no error returned). However, none of the modifications took place. Worse yet, when retrieving the application state via the CLI (az containerapp show --name in-limbo --resource-group in-limbo) the modifications were reflected in the response, but not actually applied (and not appearing in the Azure Portal). To say that this is not good is an understatement.

Deleting the container app and attempting to re-provision it yielded the following error message:

1
Container App Name: "hope-dies-last"): polling after CreateOrUpdate: Code="ContainerAppOperationError" Message="Failed to provision revision for container app 'hope-dies-last'. Error details: Operation expired."

As clearly explained in the Microsoft Azure documentation Stackoverflow user Silicium reports from the battlefield this is related to using the reference to a Key Vault secret:

I have the same issue and implemented a workaround

In our case, when using “passwordSecretRef” which points to a “keyVaultUrl” The creation fails on the first but the second run (update) works.

We implemented a workaround which replaces the “keyVaultUrl” by a regular plaintext dummy password and on a following run replacing this by the actual keyvault reference.

Unfortunately - at least on the container-apps-jobs endpoint - azure recently implemented a password validation check (within the same API version rolleyes) so the dummy password is not accepted anymore here.

2 Months ago a already opened a ticket on azure regarding this ticket but they asked me questions which have nothing to do with the issue and asked for enabling debugging flags in the API call which are not available on that specific endpoint. The ticket ended up closing.

Now since our workaround does not work around anymore, i’ve opened a new ticket with more details how to reproduce by a simple API call to not to disturb them by some middle-ware :)

However, they did not answered on that support request since days.

Back to copy-pasting passwords, it seems.

Container Apps, registries and managed identities

Managed identities, again. I should have known.

It is possible to configure password-less access to an Azure Container Registry by using a managed identity. The registry configuration then looks like this:

1
2
3
4
registry {
  server   = var.acr_login_server
  identity = var.acr_user_assigned_identity_id
}

Clean. Neat.

And it also works, when applying the change to an existing Container App.

When trying to re-provision the same container-app again from scratch, the following error appears:

1
Container App Name: "i-should-have-known-better"): polling after CreateOrUpdate: Code="ContainerAppOperationError" Message="Failed to provision revision for container app 'i-should-have-known-better'. Error details: Operation expired."

Familiar, isn’t it?

It turns out that this has been broken for over a year.

Closing thoughts

I started writing this article as I started the project, with the initial intent to write a fun piece on dealing with Azure. Working with the two other main cloud providers can sometimes also be irritating, but in comparison, there’s a few orders of magnitude of difference in the amount of frustration and time lost due to issues with the cloud provider itself in this case.

As the project progressed and the amount of bugs I discovered grew (there’s much more to report than what I found here), I kept on pressing on - until the issue with the container app stuck in limbo state. To me, it is a complete red flag and a big no-go to have the API report one thing and the portal another. It destroys the trust in the infrastructure running reliably, which is at the core of what a cloud provider is all about. You can’t build a reliable application infrastructure on a shaky foundation.

At this point I had no choice but to advise the client to choose a different vendor for continuing the project (doing a switch at this point in the project is not too big an effort - especially when factoring in all the time lost with discovering, analysing and working around Azure bugs, not even factoring in the risk related with running production applications under such conditions). I’m aware that many of the issues above are related to Container Apps, and I hope for the sake of all the poor souls stuck with Azure that this isn’t too much the case for other parts of the infrastructure.

Taking a step back, it seems that the current process by which Azure is growing its business is the following:

  • announce a new feature and providing a half-written, bug-riddled implementation of it which misses essential use-cases. At the same time, advertise the feature in sales discussions as a capability that Azure has (to be on-par with the competition)
  • provide only a high-level documentation of the feature, leaving it up to motivated community members to write articles on how to get the feature to work
  • leave it up to the users to discover bug after bug and to report them

Not cool, Microsoft.