Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to deregister a service #1188

Closed
drsnyder opened this issue Aug 20, 2015 · 65 comments
Closed

Unable to deregister a service #1188

drsnyder opened this issue Aug 20, 2015 · 65 comments
Labels
theme/operator-usability Replaces UX. Anything related to making things easier for the practitioner type/bug Feature does not function as expected
Milestone

Comments

@drsnyder
Copy link

I brought this to the attention of the mailing list here. @slackpad asked me to go ahead and file a bug. Below summary of the issue from the discussion thread.

We have services that are being orphaned and we cannot deregister them. The orphans show up under one or more of the master nodes. In our configuration the master nodes are dev-consul, dev-consul-s1, and dev-broker.

The health check of the orphaned node looks something like the following:

{
    "Node": "dev-consul",
    "CheckID": "service:discussion_8080",
    "Name": "Service 'discussion' check",
    "ServiceName": "discussion",
    "Notes": "",
    "Status": "critical",
    "ServiceID": "discussion_8080",
    "Output": ""
}

I attempted to deregister via:

user@dev-consul $ curl -X PUT -d '{"CheckID": "service:discussion_8080", "ServiceID": "discussion_8080", "Node": "dev-consul", "Datacenter": "dev"}' http://localhost:8500/v1/catalog/deregister

The node was removed but then reappears within 30-60s. As @slackpad's recommended, I tried deregistering with:

user@dev-consul $ curl -v http://localhost:8500/v1/agent/service/deregister/discussion_8080
user@dev-consul $ curl -v -X PUT -d'{"CheckID": "service:discussion_8080", "ServiceID": "discussion_8080", "Node": "dev-consul", "Datacenter": "dev"}' http://localhost:8500/v1/catalog/deregister

Both commands returned status 200 OK. But that also failed. You can see the output in this gist as well as the debug logs from consul.

From the debug logs in consul we see:

Aug 20 16:57:45 dev-broker consul[2221]: agent: Deregistered service 'discussion_8080'
Aug 20 16:57:45 dev-broker consul[2221]: agent: Check 'service:discussion_8080' in sync
Aug 20 16:57:45 dev-broker consul[2221]: agent: Deregistered check 'service:discussion_8080'
Aug 20 16:57:45 dev-broker consul[2221]: http: Request /v1/agent/service/deregister/discussion_8080 (19.73968ms)
Aug 20 16:57:46 dev-broker consul[2221]: http: Request /v1/agent/check/pass/service:discussion_8080, error: CheckID does not have associated TTL
Aug 20 16:57:46 dev-broker consul[2221]: http: Request /v1/agent/check/pass/service:discussion_8080 (246.298µs)
Aug 20 16:57:47 dev-broker consul[2221]: agent: Synced service 'discussion_8080' <--- SHADY!

The annotation is from @slackpad.

It's also noteworthy that the orphans are always associated with one of the master nodes (e.g. dev-consul) and not the node (dev-mesos) that's running the service that was registered. I should also mention (it could be a coincidence) that the service (discussion) is also flapping though from what I can tell from the debug logs for consul on dev-mesos everything is fine.

Our consul version:

$ consul version
Consul v0.5.2
Consul Protocol: 2 (Understands back to: 1)

Thanks!

@slackpad slackpad self-assigned this Aug 25, 2015
@drsnyder
Copy link
Author

drsnyder commented Sep 8, 2015

I'm not sure if this helps with the solution to the problem, but the services can be deregistered if you deregister them on all of the servers in the cluster using the local agent at more or less the same time. See this tool for what we used to force the de-registration.

So in our case, I ran the linked tool above on the three servers in the cluster. It removed about 250 orphaned services that couldn't otherwise be deregistered.

@milosgajdos
Copy link

We are seeing something equally obscure in consul I'm completely at loss of understanding what is going on but it seems somewhat similar to the issue described above.

Consul version:

# consul version
Consul v0.5.2
Consul Protocol: 2 (Understands back to: 1)

One of the services which is registered with consul dies, but consul never removes the registered key., although it does seems like it does. We figured we would use the consul's HTTP API to deregister the service. Pointless exercise, as we learnt later. Even though the consul seems to think, for a bit of time, that the record has been removed, after a bit of time the removed data re-appears out of the blue, and we are totally clueless why.

Here's the actual description:

We can curl the registered service at the beginning as expected

$ curl node1:8500/v1/catalog/service/my_service | python -mjson.tool
[
    {
        "Address": “1.2.3.4”,
        "Node": “my_service”,
        "ServiceAddress": "0.0.0.0",
        "ServiceID": "my_service:9042",
        "ServiceName": "my_service",
        "ServicePort": 9042,
        "ServiceTags": null
    }
]

We can query consul and receive the reply easily as one would expect correctly (ignore the actual IP):

$ dig -p 8500 @consul_node1 my_service.service.dc1.consul +short
1.2.3.4
$

$ dig -p 8500 @consul_node2 my_service.service.dc1.consul +short
1.2.3.4
$

$ dig -p 8500 @node3 my_service.service.dc1.consul +short
1.2.3.4
$

Now we try to deregister the service. This is the JSON payload:

$ cat my_service.json
{
  "Datacenter": "dc1",
  "Node": "my-service",
  "ServiceID": "my-service:9042"
}

We PUT it to the leader in the cluster (which is node1) - this goes fine as expected on every node in the cluster:

$ curl -X PUT -d @my_service.json node1:8500/v1/catalog/deregister
true
$
$ curl node1:8500/v1/catalog/service/my_service
[]
$
$ curl node2:8500/v1/catalog/service/my_service
[]
$
$ curl node3:8500/v1/catalog/service/my_service
[]
$

Then in about a minute or so, this happens:

$ tail logs (on node1)
…
…
2015/09/30 19:21:52 [INFO] agent: Synced service 'my_service:9042'
$

Curling service catalog indeed returns the entry:

$ curl node1:8500/v1/catalog/service/my_service | python -mjson.tool
[
    {
        "Address": “1.2.3.4”,
        "Node": “my_service”,
        "ServiceAddress": "0.0.0.0",
        "ServiceID": "my_service:9042",
        "ServiceName": "my_service",
        "ServicePort": 9042,
        "ServiceTags": null
    }
]

Now, can someone tell me what is going on here ?

@drsnyder
Copy link
Author

drsnyder commented Oct 1, 2015

I don't know the specifics of what's going on but what we have learned is that when this happens you have to deregister the service from all the consul servers. So, if you have three then you need to deregister the service from all three.

We have been this tool to clean them up. We have plans to productize it as an orphaned service reaper but we aren't there yet.

@milosgajdos
Copy link

Thanks, I'll check it out. Nevertheless this is something I'd love to understand as random data re-appearance does not fill me with confidence if I'm entirely honest.

Bugs happen in every SW, but I'd love to understand the actualy underlying problem so it does not surprise me at 3AM in the morning as it always happens in Murphy's law.

@volkantufekci
Copy link

Hi,
I'm running a single node consul(v.0.5.2) and similar issue here. I unregister a service via diplomat(ruby client) and the consul says in its stdout:

2015/11/03 11:33:08 [INFO] agent: Deregistered service 'vcs4'

But "vcs4" can still be observed in http api and web ui.

@volkantufekci
Copy link

My issue is solved.
The problem was the message in consul's output. It says a service is "deregistered" even it doesn't exist. For example I don't have a service registered with serviceID "THIS_DOES_NOT_EXIST" but when I call

curl  http://CONSUL_AGENT_URL:8500/v1/agent/service/deregister/THIS_DOES_NOT_EXIST

Consul logs as:

2015/11/03 15:11:56 [INFO] agent: Deregistered service 'THIS_DOES_NOT_EXIST'

So, in my case I was trying to deregister with a wrong serviceID, and consul's output was misleading me as it says service is deregistered instead of warning me as "there is no such service with that id"...

@thpham
Copy link

thpham commented Dec 4, 2015

Hello,

I got similar problem trying to deregister services created with registrator for docker container. It tooks me a half a day to notice that the ServiceID was generated with special characters and I had to call the API endpoint with an URL-ENCODED string ! @milosgajdos83 , if I take your previous example you should call the API like this:

curl -v -X PUT http://CONSUL_AGENT_URL:8500/v1/agent/service/deregister/my_service%3A9042

CONSUL_AGENT_URL should be the node hostname/ip where the agent registered the service.

hope It will help some people :-)

@codelotus
Copy link

It would appear as though the error checking in the consul http api is not complete. (I have not looked at the code to verify this). Hence why a successful response from a failed deregistration. @milosgajdos83 I was able to successfully register and deregister your service by changing the format of the json and by using the /v1/catalog/ endpoint.

To register a service

curl -XPUT -d @consulServiceRegister.json http://localhost:8500/v1/catalog/register

where consulServiceRegister.json is:

{ 
  "Address": "1.2.3.4",
  "Node": "test-node", 
  "Address": "0.0.0.0",
  "Service": {
    "ID": "my_service:9042",
    "Service": "my_service",
    "Port": 9042 
  }
}  

To deregister a service (note the Address is required):

curl -XPUT -d @consulServiceDeRegister.json http://localhost:8500/v1/catalog/deregister

where consulServiceDeRegister.json is:

{ 
  "Datacenter": "test-dc",
  "Node": "test-node",
  "ServiceID": "my-service:9042",
  "Address": "1.2.3.4"
}  

At this point the registered service has been successfully deregistered and after 15 minutes the service has not returned:

curl http://localhost:8500/v1/catalog/services                                                   
{"consul":[]} 

@javaxplorer
Copy link

The example from @codelotus works as long as you register with the catalog and not with the agent.
If you do the following call:

curl -XPUT -d @consulServiceRegisterAgent.json http://10.98.204.21:8500/v1/agent/service/register

Where consulServiceRegisterAgent.json is:

{
  "ID": "my_service:9042",
  "Name": "my_service",
  "Address": "1.2.3.4",
  "Port": 9042
}

And then do a deregister:

curl -XPUT -d @consulServiceDeRegister.json http://localhost:8500/v1/catalog/deregister

where consulServiceDeRegister.json is:

{ 
  "Datacenter": "test-dc",
  "Node": "test-node",
  "ServiceID": "my-service:9042",
  "Address": "1.2.3.4"
}  

The service will respawn in a minute or so :(

@slackpad slackpad added the type/bug Feature does not function as expected label Jan 13, 2016
@cabrinoob
Copy link

Same problem here. I have Zombies services which come back to life whatever deregistering technics I use.

@peterklipfel
Copy link

I'm load balancing with consul-template, and this is causing me some major headaches. Round robin load balancing to services that may or may not exist creates ridiculous, cascading networking bugs.

What I found was that the master said that one of my members had failed, but that member thought that it was still alive. I made the member leave, and then rejoin. This fixed the issue.

@slackpad
Copy link
Contributor

Wanted to clarify - I think there are a few things going on here in this issue:

  1. The original problem posted by @drsnyder looks to be an issue with services registered on the Consul servers - that is an outstanding thing we need to track down.
  2. The error checking problem pointed out by @volkantufekci needs to be fixed because that adds to confusion by returning bogus success responses.
  3. The problems encountered by @milosgajdos83 and @cjhkramer look like a common source of confusion around using Consul. We need to beef up the docs on this - an explanation of this follows.

In Consul it's extremely rare to use the Catalog API directly. The Agent API (https://www.consul.io/docs/agent/http/agent.html) should almost always be used. For services running on Consul agents, the agent is the source of truth, not the catalog maintained by the servers. Periodically, the agents perform an anti-entropy sync and use the Catalog API internally to update the servers to have the correct state. This means that if you use the catalog API to deregister a service, it will disappear for a little while then the agent will put that back on the next sync. If you use the Agent API it will take care of removing the service from the catalog for you.

The call to https://www.consul.io/docs/agent/http/agent.html#agent_service_deregister should be made on the agent where the service is registered.

@ch3lo
Copy link

ch3lo commented Feb 19, 2016

I had zombie services in /v1/catalog/service/... but not in /v1/agent/services. I did a "consul rejoin" in the agent related with the zombie and they disappeared. I think something rare are with the entropy sync from slaves to master.

@doublerebel
Copy link

Am being bitten by this today. Attempting to set maintenance mode on a nonexistent service correctly returns a 404. But, I can send any ID to a deregister endpoint and get a 200 OK, whether the service exists or not. I would expect any endpoint that takes an ID to return a 404 if that ID does not exist.

(I also can't seem to deregister a service with a . in the ID, despite that being a legal URL and not needing URL encoding EDIT: this might not be the case). Issues #1333, #1138, #1096 are related in case anyone there needs this thread.

I did notice that a successful service deregister also prints Deregistered check... to the logs. A nonexistent service has no checks. (I did make sure to do all this with the agent and not the catalog.)

Now I'm also wishing for a "deregister service" button in the UI, to solve this for me. Thanks all for your suggestions and helper examples.

@josegonzalez
Copy link

Seems like the file for that service actually still exists on a box, even when issuing a deregistration to that box (testing with a single consul instance).

Removing the service file on the box and deregistering didn't appear to fix it. Neither did removing the local.snapshot on it. Removing both the local and remote snapshot did have an effect though.

@ghost
Copy link

ghost commented Apr 22, 2016

Hello.
Is there any progress on this? I am having the issues described here and a lot of trouble.

@kbroughton
Copy link

same. Pretty major flaw. Consul-template picks up the old service.

@lowzj
Copy link

lowzj commented May 13, 2016

Hello.
same problem here. I try to deregister some critical services from consul server that is stopped but not deregister correctly. Is there any progress?

@babbottscott
Copy link

For configuring a client consuming a service, would service health rather than service catalog be a more appropriate option? I may be underestimating the bug here, but ISTM service catalog is prone to extraneous data (either from new services not yet ready for consumption, or decommissioned services) by design.

@alexykot
Copy link

I can confirm that on version 0.6.4 I cannot reproduce this issue any more on a test setup.

I've build a small test setup with three consul agents sitting in containers on the same node talking to each other.

Then I've created a test service through endpoint PUT /v1/agent/service/register on one node, and confirmed it has propagated in seconds to other two agents and is available through GET /v1/catalog/services on each agent.

And when I deregistered service on the same agent it was created on with DELETE /v1/agent/service/deregister/test-service1 - it has gone away instantly from catalogs on all three nodes.

@n8gard
Copy link

n8gard commented Jun 18, 2016

I just stood up the Consul UI in our environment then killed some EC2 instances which means they didn't gracefully leave the system. I see them as failed nodes in the UI--so far so good. But, when I click the Deregister button, they do go away. However, upon reloading the UI, they are there again. Have done this many times. It very well could be something wrong on my side as this is a very new environment and I'm doing this for the first time, but, it sounds exactly like this issue.

I'm on v0.6.4 on Ubuntu 14.04 LTS.

@skyrocknroll
Copy link

@alexykot Actually the checks are registered by nomad in my case

@ghost
Copy link

ghost commented Jun 21, 2016

@alexykot

And when I deregistered service on the same agent it was created on with DELETE /v1/agent/service/deregister/test-service1 - it has gone away instantly from catalogs on all three nodes.

have you tried to deregister from other nodes different than the one you have used to register? That's when they come back. I'm not sure if it is supposed to work this way though.

@alexykot
Copy link

@webertlima
You cannot deregister a service from the agent on a different node, service only exists on the agent you have registered with. It also exists in the catalog on all nodes, but that is not related to the agent itself. And to be honest I don't understand why there is a catalog/deregister endpoint at all, in my opinion catalog should be a read-only service list.

@ghost
Copy link

ghost commented Jun 21, 2016

@alexykot thanks for clearing that up.

@flypenguin
Copy link

flypenguin commented Jun 21, 2016

I go crazy right now. I use consul as a service registry (which it apparently is), but I am completely unable to deregister services. I am trying to use the deregister endpoint, and I am having the exact same behavior - and it seriously f*cks with my network setup.

I use a consul-template to configure haproxy for services which appear and vanish). Because of my system setup I use only one central agent to register services with, and it seems I will be unable forever to deregister them.

This is a superbly bad situation, and I really do not understand the point of the /deregister endpoint if it can't be used, and I even with a read-only catalog I would assume I could remove services at some point. (What's the point of a distributed system if you have some weird logic about which nodes to use for some operations anyway?)

Update: I also tried de-registering on the node where the service runs, and still it's coming back.

I just. Don't. Get. It.

@flypenguin
Copy link

I have now managed to get rid of those services, by stopping all consul versions, killing the data directory, and re-starting them. this is not the way to go IMHO. for a single test case the de-registration now seems to work fine, for whatever reason. I am thinking of moving away from consul as fast as possible now, because this kind of undeterministic behavior makes it impossible to rely on it as a central infrastructure part, and consul currently is the backbone of my service management.

I really like consul though and would be super happy if there could be a solution for this.

@slackpad
Copy link
Contributor

Hi @flypenguin sorry you are having trouble. There are some issues called out in #1188 (comment), but Consul's behavior is definitely deterministic. I think you are running into problems because of this:

I use only one central agent to register services with

Consul's really not designed to run all registrations through a central set of agents. In Consul, the agent holds the information about which services are registered, and then takes responsibility for syncing that information up to the catalog maintained by the servers. If you delete a service from the catalog, the agent will put it back (which I agree is confusing and we need to document that more clearly). To remove a service you always need to remove it using the Agent API and it will remove it from the catalog for you.

If you run an agent on each node and always register/deregister using that agent for the services on that node then things should work properly (and if that node dies all of its services will eventually be reaped automatically). If you are running a small number of agents and registering everything through those, setting the addresses manually, it is easy to lose track of where a service was registered, making it hard to remove it. I'd strongly recommend against running Consul like this - it also prevents reaping as described above, and things like sessions from working properly. Consul is designed to have the agent running on each node in the cluster.

The other issue on here where you get a 200 when deleting even if a service doesn't exist adds to the confusion; we will also fix that. Sorry for the trouble - hopefully you can get things working well in your setup!

@AjitDas
Copy link

AjitDas commented Mar 19, 2017

This is a horrible and very bad solution provided by consul experts to be able to only register & deregister from same host where the agent is running. This defeats the whole purpose of high availability & resilient architecture. I have same issue where I have 5 consul running behind a AWS ALB/ELB as any other service with docker along with AWS ECS so that it's scalable & highly available with number of AWS tasks and I don't want a consul client or agent to run in every EC2 instances where my applications are running. I have close to thousands of servers running in AWS ECS clusters for many applications and this becomes unmanageable if I have to run a consul client in each of these EC2 instances. I don't want to hardcode the consul ip & port as well. When I use agent APIs via ELB/ALB url then it lands in one of 5 consul instance and works fine but when deregister using load balancer then it has 20% of success chance as it can go to any of 5 consul nodes and with frequent updates to my service deployments too many service ids dangles under a service name and it creates big headache.

This is a must need to be register/deregister from any node, am surprised how come this is not thought through and would love to hear solutions from biggies who uses thousands of EC2 instances for their applications.

@slackpad
Copy link
Contributor

Hi @AjitDas iConsul provides solutions to the problems you mention if you run the agent on each node in your cluster. Consul's not designed to run just as a set of servers behind a load balancer; running it that way means you take on solving these problems yourself. When you run agents on each node and have your applications register themselves, they will sync up the catalog on the servers for you, and health checks can be performed locally on the agent, the results of which will get synced automatically as well. The agents perform checks against each other, forming an efficient failure detector which will arrange to have the catalog cleaned up in the event that a node dies and doesn't deregister itself. Applications only need to talk to their local agent, and Consul will route requests to a healthy server automatically with no load balancer. Many, many folks are successfully running Consul clusters with thousands of nodes in this fashion. Hope that helps!

@slackpad slackpad removed this from the Triaged milestone Apr 18, 2017
@slackpad slackpad added the theme/operator-usability Replaces UX. Anything related to making things easier for the practitioner label May 25, 2017
@GreatSnoopy
Copy link

@slackpad but what you say does not cover the situation when for some reason the node that registered the service in the first place is not available any more and cannot be made available again. For this very situation, there should be an option to forcibly deregister a service, even if you must do it on one of the few voting nodes. And yes, it defeats the purpose to have consul as a distributed system and not being able to do some operations on more than one certain node.
It is even counter-intuitive to be able to set/unset key-values in the kv store from any node and not be able to do the same with the services. Please change this behavior, it really does not make sense the way it is managed now.

@slackpad
Copy link
Contributor

@GreatSnoopy if the node is indeed gone from the cluster then you have two ways to remove a stale service from one of the remaining nodes (it will get cleaned up by Consul after 72 hours if you don't do anything):

  1. Use the consul force-leave command or Agent API to immediately remove the node from the cluster, which will remove its associated services.

  2. Use the Catalog API to remove the services. If the node is gone then they will not be re-registered automatically (the agent on the node is what causes that).

@mritd
Copy link

mritd commented Nov 6, 2017

Same problem here.

Consul v0.9.3

image

image

@dataviruset
Copy link

dataviruset commented Nov 8, 2017

Have a similar problem here on Consul 1.0.0. A service ('problematic-service') keeps coming back so the Consul client is deregistering it all the time, but it doesn't go away and is still visible from the other nodes.

nov 09 00:15:44 myserver consul[2866]: 2017/11/09 00:15:44 [INFO] agent: Synced service 'a-service'
nov 09 00:15:44 myserver consul[2866]: agent: Synced service 'a-service'
nov 09 00:15:44 myserver consul[2866]: agent: Deregistered service 'problematic-service'
nov 09 00:15:44 myserver consul[2866]: 2017/11/09 00:15:44 [INFO] agent: Deregistered service 'problematic-service'

EDIT: I discovered that I had a node_name conflict. Fixed that, removed the serf folders, executed force-leave on the Consul servers seems to have fixed the problem.

@alexeyknyshev
Copy link

So is there way to purge service that was registered wrong way (say, via agent on another node & I don't know which one actually)?

@webertrlz
Copy link

best way is to keep track of what agent was used to register on your application, and use the same to deregister.

upon node failure, the application should use the catalog api to deregister services.

the main problem is when a node serving both agent & applications die, then such applications won't deregister, and one must have another application to take care of healthcheck + deregistering using the catalog api

@slackpad
Copy link
Contributor

best way is to keep track of what agent was used to register on your application, and use the same to deregister.

This is true - the local agent where it was registered using configs or /v1/agent APIs should be used to deregister it.

the main problem is when a node serving both agent & applications die, then such applications won't deregister, and one must have another application to take care of healthcheck + deregistering using the catalog api

This should not be necessary. The serfHealth check for the dead node will fail within a few seconds, effectively marking all of the services there offline. Consul will automatically clean up the catalog in 72 hours if the node doesn't come back.

@webertrlz
Copy link

The serfHealth check for the dead node will fail within a few seconds, effectively marking all of the services there offline.

I don't know why that doesn't work for me. Maybe it's because I don't use service health checks on my registered services? (I don't use it beucase services start in an isolated network, so consul agents can't reach them to check health).

@mritd
Copy link

mritd commented Jan 24, 2018

I wrote a little tool to clear the failed service, it will clear all agent failed service (I assume that each node is an agent), it works well in my cluster 😀

package main

import (
    "github.com/hashicorp/consul/api"
    "log"
    //"os"
)

func main() {

    //os.Setenv("CONSUL_HTTP_ADDR", "172.16.0.18:8500")
    client, err := api.NewClient(api.DefaultConfig())
    if err != nil {
        log.Panicln("Init client failed:", err)
    }

    allNodes, _, err := client.Catalog().Nodes(nil)
    if err != nil {
        log.Panicln("Query all known nodes failed:", err)
    }

    allClients := map[string]*api.Client{}
    for _, node := range allNodes {
        tmpConfig := api.DefaultConfig()
        tmpConfig.Address = node.Address + ":8500"
        tmpClient, err := api.NewClient(tmpConfig)
        if err != nil {
            log.Println("Client:", tmpConfig.Address, "create Failed!")
        } else {
            allClients[tmpConfig.Address] = tmpClient
        }
    }

    for address, tmpClient := range allClients {

        allChecks, err := tmpClient.Agent().Checks()
        if err != nil {
            log.Println("Get registered checks failed:", address)
            continue
        }

        log.Println("Clean ===>", address)

        for _, v := range allChecks {
            if v.Status == "critical" {
                log.Println("Deregister ==>", v.ServiceID)
                tmpClient.Agent().ServiceDeregister(v.ServiceID)
            }
        }
    }
}

@slackpad slackpad added this to the Unplanned milestone Feb 2, 2018
@marshell0
Copy link

marshell0 commented Feb 8, 2018

Don't use catalog, instead of using agent, the reason is catalog is maintained by agents, it will be resync-back by agent even if you remove it from catalog, remove zombie services shell script:

leader="$(curl http://ONE-OF-YOUR-CLUSTER:8500/v1/status/leader | sed 's/:8300//' | sed 's/"//g')"
while :
do 
  serviceID="$(curl http://$leader:8500/v1/health/state/critical | ./jq '.[0].ServiceID' | sed 's/"//g')"
  node="$(curl http://$leader:8500/v1/health/state/critical | ./jq '.[0].Node' | sed 's/"//g')"
  echo "serviceID=$serviceID, node=$node"
  size=${#serviceID}
  echo "size=$size"
  if [ $size -ge 7 ]; then
     curl --request PUT http://$node:8500/v1/agent/service/deregister/$serviceID
  else
     break
  fi
done
curl http://$leader:8500/v1/health/state/critical

json parser jq is used for field retrieving

@weiwei04
Copy link
Contributor

weiwei04 commented Feb 9, 2018

Hello, @slackpad I understand this

In Consul, the agent holds the information about which services are registered, and then takes responsibility for syncing that information up to the catalog maintained by the servers. If you delete a service from the catalog, the agent will put it back (which I agree is confusing and we need to document that more clearly).

anti-entropy mechanism, I am just curious about v1/catalog/register and v1/catalog/deregister api use case, since it can't really register/deregister a service on a real agent. May be register/deregister service on a fake node?

@slackpad
Copy link
Contributor

slackpad commented Feb 9, 2018

@weiwei04

anti-entropy mechanism, I am just curious about v1/catalog/register and v1/catalog/deregister api use case, since it can't really register/deregister a service on a real agent. May be register/deregister service on a fake node?

Exactly! Please see https://www.consul.io/docs/guides/external.html

@webertrlz
Copy link

Catalog API is useful on my case, where the services are still registerd upon a node failure. I could use it to deregister them from the catalog, since the node won't be coming back.

@stuartmclean
Copy link

stuartmclean commented Feb 26, 2018

With our consul installation there is a json config file on all servers running the service that needs to be deleted and consul restarted, otherwise the service will reappear after a few seconds. This file exists in a dir specified to consul with -config-dir flag.

@kyhavlov
Copy link
Contributor

There's been a lot of fixes to anti-entropy in the last year and this shouldn't be happening any more, if anyone is still seeing this we can re-open this issue.

@qiangmzsx
Copy link

I use the consul 1.0.6 version in API Client.Catalog ().Deregister (), delete service and reappear after 30s @kyhavlov

@nfirvine
Copy link

I think this design is starting to make sense to me. The only real way to get rid of a node that left ungracefully is to start it back up and get it to leave the cluster gracefully.

That said, it's pretty easy to masquerade as the dead node. Just start an agent with the same node name and ID. Then leave gracefully.

That said, I'm still trying to track down why I'm seeing all these zombie nodes in the first place.

@rene-m-hernandez
Copy link

On 1.0.2 and experiencing this "issue".

@shantanugadgil
Copy link
Contributor

I use HashiUI to force clean such leftover services and health checks.

@patricksuo
Copy link

quoting @slackpad 's comment:

In Consul it's extremely rare to use the Catalog API directly. The Agent API (https://www.consul.io/docs/agent/http/agent.html) should almost always be used. For services running on Consul agents, the agent is the source of truth, not the catalog maintained by the servers. Periodically, the agents perform an anti-entropy sync and use the Catalog API internally to update the servers to have the correct state. This means that if you use the catalog API to deregister a service, it will disappear for a little while then the agent will put that back on the next sync. If you use the Agent API it will take care of removing the service from the catalog for you.

Maybe it's working as intended.

@kyhavlov
Copy link
Contributor

Deregistering a service/node via the Catalog api and having that entry put back by the node is intended - the agent is supposed to sync that info back to the catalog, so the correct way to deregister a service is to use the /v1/agent/service/deregister endpoint which will cause the agent to trigger an update to the catalog and remove the service on its own.

@mritd
Copy link

mritd commented Aug 23, 2018

The reason I found this problem is:

Deregistered with a client of another agent

In this case the deregister service will be resynchronized, so I created a small tool(https://github.com/Gozap/cclean) to clean up the unhealthy service.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
theme/operator-usability Replaces UX. Anything related to making things easier for the practitioner type/bug Feature does not function as expected
Projects
None yet
Development

No branches or pull requests