New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unable to deregister a service #1188
Comments
I'm not sure if this helps with the solution to the problem, but the services can be deregistered if you deregister them on all of the servers in the cluster using the local agent at more or less the same time. See this tool for what we used to force the de-registration. So in our case, I ran the linked tool above on the three servers in the cluster. It removed about 250 orphaned services that couldn't otherwise be deregistered. |
We are seeing something equally obscure in consul I'm completely at loss of understanding what is going on but it seems somewhat similar to the issue described above. Consul version:
One of the services which is registered with consul dies, but consul never removes the registered key., although it does seems like it does. We figured we would use the consul's HTTP API to deregister the service. Pointless exercise, as we learnt later. Even though the consul seems to think, for a bit of time, that the record has been removed, after a bit of time the removed data re-appears out of the blue, and we are totally clueless why. Here's the actual description: We can curl the registered service at the beginning as expected
We can query consul and receive the reply easily as one would expect correctly (ignore the actual IP):
Now we try to deregister the service. This is the JSON payload:
We
Then in about a minute or so, this happens:
Curling service catalog indeed returns the entry:
Now, can someone tell me what is going on here ? |
I don't know the specifics of what's going on but what we have learned is that when this happens you have to deregister the service from all the consul servers. So, if you have three then you need to deregister the service from all three. We have been this tool to clean them up. We have plans to productize it as an orphaned service reaper but we aren't there yet. |
Thanks, I'll check it out. Nevertheless this is something I'd love to understand as random data re-appearance does not fill me with confidence if I'm entirely honest. Bugs happen in every SW, but I'd love to understand the actualy underlying problem so it does not surprise me at 3AM in the morning as it always happens in Murphy's law. |
Hi,
But "vcs4" can still be observed in http api and web ui. |
My issue is solved.
Consul logs as:
So, in my case I was trying to deregister with a wrong serviceID, and consul's output was misleading me as it says service is deregistered instead of warning me as "there is no such service with that id"... |
Hello, I got similar problem trying to deregister services created with registrator for docker container. It tooks me a half a day to notice that the ServiceID was generated with special characters and I had to call the API endpoint with an
hope It will help some people :-) |
It would appear as though the error checking in the consul http api is not complete. (I have not looked at the code to verify this). Hence why a successful response from a failed deregistration. @milosgajdos83 I was able to successfully register and deregister your service by changing the format of the json and by using the To register a service
where consulServiceRegister.json is:
To deregister a service (note the Address is required):
where consulServiceDeRegister.json is:
At this point the registered service has been successfully deregistered and after 15 minutes the service has not returned:
|
The example from @codelotus works as long as you register with the catalog and not with the agent.
Where consulServiceRegisterAgent.json is:
And then do a deregister:
where consulServiceDeRegister.json is:
The service will respawn in a minute or so :( |
Same problem here. I have Zombies services which come back to life whatever deregistering technics I use. |
I'm load balancing with consul-template, and this is causing me some major headaches. Round robin load balancing to services that may or may not exist creates ridiculous, cascading networking bugs. What I found was that the master said that one of my members had failed, but that member thought that it was still alive. I made the member leave, and then rejoin. This fixed the issue. |
Wanted to clarify - I think there are a few things going on here in this issue:
In Consul it's extremely rare to use the Catalog API directly. The Agent API (https://www.consul.io/docs/agent/http/agent.html) should almost always be used. For services running on Consul agents, the agent is the source of truth, not the catalog maintained by the servers. Periodically, the agents perform an anti-entropy sync and use the Catalog API internally to update the servers to have the correct state. This means that if you use the catalog API to deregister a service, it will disappear for a little while then the agent will put that back on the next sync. If you use the Agent API it will take care of removing the service from the catalog for you. The call to https://www.consul.io/docs/agent/http/agent.html#agent_service_deregister should be made on the agent where the service is registered. |
I had zombie services in /v1/catalog/service/... but not in /v1/agent/services. I did a "consul rejoin" in the agent related with the zombie and they disappeared. I think something rare are with the entropy sync from slaves to master. |
Am being bitten by this today. Attempting to set maintenance mode on a nonexistent service correctly returns a 404. But, I can send any ID to a deregister endpoint and get a 200 OK, whether the service exists or not. I would expect any endpoint that takes an ID to return a 404 if that ID does not exist. (I also can't seem to deregister a service with a I did notice that a successful service deregister also prints Now I'm also wishing for a "deregister service" button in the UI, to solve this for me. Thanks all for your suggestions and helper examples. |
Seems like the file for that service actually still exists on a box, even when issuing a deregistration to that box (testing with a single consul instance). Removing the service file on the box and deregistering didn't appear to fix it. Neither did removing the local.snapshot on it. Removing both the local and remote snapshot did have an effect though. |
Hello. |
same. Pretty major flaw. Consul-template picks up the old service. |
Hello. |
For configuring a client consuming a service, would service health rather than service catalog be a more appropriate option? I may be underestimating the bug here, but ISTM service catalog is prone to extraneous data (either from new services not yet ready for consumption, or decommissioned services) by design. |
I can confirm that on version I've build a small test setup with three consul agents sitting in containers on the same node talking to each other. Then I've created a test service through endpoint And when I deregistered service on the same agent it was created on with |
I just stood up the Consul UI in our environment then killed some EC2 instances which means they didn't gracefully I'm on v0.6.4 on Ubuntu 14.04 LTS. |
@alexykot Actually the checks are registered by nomad in my case |
have you tried to deregister from other nodes different than the one you have used to register? That's when they come back. I'm not sure if it is supposed to work this way though. |
@webertlima |
@alexykot thanks for clearing that up. |
I go crazy right now. I use consul as a service registry (which it apparently is), but I am completely unable to deregister services. I am trying to use the deregister endpoint, and I am having the exact same behavior - and it seriously f*cks with my network setup. I use a consul-template to configure haproxy for services which appear and vanish). Because of my system setup I use only one central agent to register services with, and it seems I will be unable forever to deregister them. This is a superbly bad situation, and I really do not understand the point of the /deregister endpoint if it can't be used, and I even with a read-only catalog I would assume I could remove services at some point. (What's the point of a distributed system if you have some weird logic about which nodes to use for some operations anyway?) Update: I also tried de-registering on the node where the service runs, and still it's coming back. I just. Don't. Get. It. |
I have now managed to get rid of those services, by stopping all consul versions, killing the data directory, and re-starting them. this is not the way to go IMHO. for a single test case the de-registration now seems to work fine, for whatever reason. I am thinking of moving away from consul as fast as possible now, because this kind of undeterministic behavior makes it impossible to rely on it as a central infrastructure part, and consul currently is the backbone of my service management. I really like consul though and would be super happy if there could be a solution for this. |
Hi @flypenguin sorry you are having trouble. There are some issues called out in #1188 (comment), but Consul's behavior is definitely deterministic. I think you are running into problems because of this:
Consul's really not designed to run all registrations through a central set of agents. In Consul, the agent holds the information about which services are registered, and then takes responsibility for syncing that information up to the catalog maintained by the servers. If you delete a service from the catalog, the agent will put it back (which I agree is confusing and we need to document that more clearly). To remove a service you always need to remove it using the Agent API and it will remove it from the catalog for you. If you run an agent on each node and always register/deregister using that agent for the services on that node then things should work properly (and if that node dies all of its services will eventually be reaped automatically). If you are running a small number of agents and registering everything through those, setting the addresses manually, it is easy to lose track of where a service was registered, making it hard to remove it. I'd strongly recommend against running Consul like this - it also prevents reaping as described above, and things like sessions from working properly. Consul is designed to have the agent running on each node in the cluster. The other issue on here where you get a 200 when deleting even if a service doesn't exist adds to the confusion; we will also fix that. Sorry for the trouble - hopefully you can get things working well in your setup! |
This is a horrible and very bad solution provided by consul experts to be able to only register & deregister from same host where the agent is running. This defeats the whole purpose of high availability & resilient architecture. I have same issue where I have 5 consul running behind a AWS ALB/ELB as any other service with docker along with AWS ECS so that it's scalable & highly available with number of AWS tasks and I don't want a consul client or agent to run in every EC2 instances where my applications are running. I have close to thousands of servers running in AWS ECS clusters for many applications and this becomes unmanageable if I have to run a consul client in each of these EC2 instances. I don't want to hardcode the consul ip & port as well. When I use agent APIs via ELB/ALB url then it lands in one of 5 consul instance and works fine but when deregister using load balancer then it has 20% of success chance as it can go to any of 5 consul nodes and with frequent updates to my service deployments too many service ids dangles under a service name and it creates big headache. This is a must need to be register/deregister from any node, am surprised how come this is not thought through and would love to hear solutions from biggies who uses thousands of EC2 instances for their applications. |
Hi @AjitDas iConsul provides solutions to the problems you mention if you run the agent on each node in your cluster. Consul's not designed to run just as a set of servers behind a load balancer; running it that way means you take on solving these problems yourself. When you run agents on each node and have your applications register themselves, they will sync up the catalog on the servers for you, and health checks can be performed locally on the agent, the results of which will get synced automatically as well. The agents perform checks against each other, forming an efficient failure detector which will arrange to have the catalog cleaned up in the event that a node dies and doesn't deregister itself. Applications only need to talk to their local agent, and Consul will route requests to a healthy server automatically with no load balancer. Many, many folks are successfully running Consul clusters with thousands of nodes in this fashion. Hope that helps! |
@slackpad but what you say does not cover the situation when for some reason the node that registered the service in the first place is not available any more and cannot be made available again. For this very situation, there should be an option to forcibly deregister a service, even if you must do it on one of the few voting nodes. And yes, it defeats the purpose to have consul as a distributed system and not being able to do some operations on more than one certain node. |
@GreatSnoopy if the node is indeed gone from the cluster then you have two ways to remove a stale service from one of the remaining nodes (it will get cleaned up by Consul after 72 hours if you don't do anything):
|
Have a similar problem here on Consul 1.0.0. A service ('problematic-service') keeps coming back so the Consul client is deregistering it all the time, but it doesn't go away and is still visible from the other nodes.
EDIT: I discovered that I had a node_name conflict. Fixed that, removed the serf folders, executed force-leave on the Consul servers seems to have fixed the problem. |
So is there way to purge service that was registered wrong way (say, via agent on another node & I don't know which one actually)? |
best way is to keep track of what agent was used to register on your application, and use the same to deregister. upon node failure, the application should use the catalog api to deregister services. the main problem is when a node serving both agent & applications die, then such applications won't deregister, and one must have another application to take care of healthcheck + deregistering using the catalog api |
This is true - the local agent where it was registered using configs or /v1/agent APIs should be used to deregister it.
This should not be necessary. The |
I don't know why that doesn't work for me. Maybe it's because I don't use service health checks on my registered services? (I don't use it beucase services start in an isolated network, so consul agents can't reach them to check health). |
I wrote a little tool to clear the failed service, it will clear all agent failed service (I assume that each node is an agent), it works well in my cluster 😀 package main
import (
"github.com/hashicorp/consul/api"
"log"
//"os"
)
func main() {
//os.Setenv("CONSUL_HTTP_ADDR", "172.16.0.18:8500")
client, err := api.NewClient(api.DefaultConfig())
if err != nil {
log.Panicln("Init client failed:", err)
}
allNodes, _, err := client.Catalog().Nodes(nil)
if err != nil {
log.Panicln("Query all known nodes failed:", err)
}
allClients := map[string]*api.Client{}
for _, node := range allNodes {
tmpConfig := api.DefaultConfig()
tmpConfig.Address = node.Address + ":8500"
tmpClient, err := api.NewClient(tmpConfig)
if err != nil {
log.Println("Client:", tmpConfig.Address, "create Failed!")
} else {
allClients[tmpConfig.Address] = tmpClient
}
}
for address, tmpClient := range allClients {
allChecks, err := tmpClient.Agent().Checks()
if err != nil {
log.Println("Get registered checks failed:", address)
continue
}
log.Println("Clean ===>", address)
for _, v := range allChecks {
if v.Status == "critical" {
log.Println("Deregister ==>", v.ServiceID)
tmpClient.Agent().ServiceDeregister(v.ServiceID)
}
}
}
} |
Don't use catalog, instead of using agent, the reason is catalog is maintained by agents, it will be resync-back by agent even if you remove it from catalog, remove zombie services shell script:
json parser jq is used for field retrieving |
Hello, @slackpad I understand this
anti-entropy mechanism, I am just curious about |
Exactly! Please see https://www.consul.io/docs/guides/external.html |
Catalog API is useful on my case, where the services are still registerd upon a node failure. I could use it to deregister them from the catalog, since the node won't be coming back. |
With our consul installation there is a json config file on all servers running the service that needs to be deleted and consul restarted, otherwise the service will reappear after a few seconds. This file exists in a dir specified to consul with |
There's been a lot of fixes to anti-entropy in the last year and this shouldn't be happening any more, if anyone is still seeing this we can re-open this issue. |
I use the consul 1.0.6 version in API Client.Catalog ().Deregister (), delete service and reappear after 30s @kyhavlov |
I think this design is starting to make sense to me. The only real way to get rid of a node that left ungracefully is to start it back up and get it to leave the cluster gracefully. That said, it's pretty easy to masquerade as the dead node. Just start an agent with the same node name and ID. Then leave gracefully. That said, I'm still trying to track down why I'm seeing all these zombie nodes in the first place. |
On 1.0.2 and experiencing this "issue". |
I use HashiUI to force clean such leftover services and health checks. |
quoting @slackpad 's comment:
Maybe it's working as intended. |
Deregistering a service/node via the Catalog api and having that entry put back by the node is intended - the agent is supposed to sync that info back to the catalog, so the correct way to deregister a service is to use the |
The reason I found this problem is: Deregistered with a client of another agent In this case the deregister service will be resynchronized, so I created a small tool(https://github.com/Gozap/cclean) to clean up the unhealthy service. |
I brought this to the attention of the mailing list here. @slackpad asked me to go ahead and file a bug. Below summary of the issue from the discussion thread.
We have services that are being orphaned and we cannot deregister them. The orphans show up under one or more of the master nodes. In our configuration the master nodes are
dev-consul
,dev-consul-s1
, anddev-broker
.The health check of the orphaned node looks something like the following:
I attempted to deregister via:
The node was removed but then reappears within 30-60s. As @slackpad's recommended, I tried deregistering with:
Both commands returned status
200 OK
. But that also failed. You can see the output in this gist as well as the debug logs from consul.From the debug logs in consul we see:
The annotation is from @slackpad.
It's also noteworthy that the orphans are always associated with one of the master nodes (e.g.
dev-consul
) and not the node (dev-mesos
) that's running the service that was registered. I should also mention (it could be a coincidence) that the service (discussion
) is also flapping though from what I can tell from the debug logs for consul ondev-mesos
everything is fine.Our consul version:
Thanks!
The text was updated successfully, but these errors were encountered: