Hi, I'm playing with cluster and can't understand how to use the "Critical Resource Monitoring" feature. To my understanding, one should enter a critical resource address (application server IP) here. If switch can't communicate to that address (in which manner - just ping or smth more?) - the switch should de-adopt APs, because it can't provide proper service. The manual (RG - RFS System - 4.2 (72E-132942-01A, 2009-12)), however says: Enter the IP adddress of the Critical Resource. When the heartbeat is lost, this resource will be checked for reachability. The critical resource can be any gateway, server or host. If the critical resource is not reachable and the heartbeat is still lost, the switch will deadopt APs and continue to deadopt APs until instructed otherwise. What does it have to do with heartbeat? Why should the switch keep the APs if it's lost connection to cluster but still could connect to critical resource? And finally - what scenario is it for? The only thing I could make up is to keep the data flowing, but the switch that lost the heartbeat should not drop APs anyway....
Critical Resource Monitoring Feature (WiNG Cluster)// Expert user has replied. |
9 Replies
Guys, thank for your efforts, but I still don't understand two things: 1. W hy a switch should keep the APs if the CR is unreachable, but the switch is still in the cluster? (as per definition of CR monitoring feature) 2. If CR is not defined, why would switch ever decide to drop APs when cluster falls apart? Here are my examples: 1. An N-switch active-active cluster. Each switch has uplink connection and a separate connection/VLAN for heartbeat. Someone plugs out switches' uplink (or messes with uplink VLAN). The hearbeat is still there (separate link). As per current CR Monitoring definition the switch will keep the aps (CR - out, heartbeat - present). This switch is no longer able to provide service. But still doesn't want to give up APs. Why? I think it's wrong behavior. 2. A two-switch cluster. CR is not defined. The other switch dies. Heartbeat disappears. Why would the remaining switch want to un-adopt own APs only to re-adopt them later causing major network disruption and reducing all the clustering investment to zero? If we combine the CR and HB factors, we'll get these these outcomes (HB = heartbeat, "+" = present, "-" - absent): CR+, HB+ = everything's ok, continue working. CR+, HB- = Cluster died, I don't know what happened, but I still can provide services and will do this - keep working, expect APs from other switches to come in for adoption. CR-, HB+ = I've lost my ability to provide service, hope the rest of cluster could handle my load. Unadopt APs. If CR is lost for all the cluster, nothing would work anyway. CR-, HB- = everything seems down, don't know if it's just me or everybody else, can't provide service anyway - unadopt APs. So, If I was correct in describing the expected behavior (please correct me, if not), we clearly see that CR and heartbeat are totally independent factors. No CR = no services for AP clients, no heartbeat (HB) = "I'm the only one left standing, everybody else is dead". Which is definitely not what's written in the manual. So, either I'm not getting something, or we have a room for improvement here. Which is which?
Okay fine, let me explain the cluster behavior in 3.X builds and answers for your questions are inline,
Cluster behavior in 3.x with out critical resource (CR) configured,
Topology details:-
==============
Consider SW1 is wireless switch 1 and Sw2 is wireless switch 2, both are WS5100. In WS5100 eth1 consider as wireless connectivity side and Eth2 is RON or the network connected to the core network / back bone network.
Eth1 of SW1 and SW2 is connected each other via multiple L2 Switches (POE) and AP are connected to the L2 Switches (POE).
Eth2 of SW1 and SW2 is connected each other via multiple L2 Switches and connects to the gateway or internet or back bone.
Redundancy is enabled via Eth2 of both switches (I mean the Vlan mapped to eth2 interface)
In this topology, consider the link between L2 Switch directly connected to Eth2 of SW1 is broken, so the redundancy will fails, because the heartbeat from SW2 will not reach to SW1. But SW2 can reach to the back bone or core network via another L2 Switch directly connected to Eth2 of SW2. (Imagine between SW1 and SW2 there are more than 2 L2 switches in the network) . The behavior will be SW1 will not dis adopt the AP’s adopted and the Mus connected to SW1 will not be able to reach back bone or internet. But in the same network SW2 has still connectivity to back bone or internet. Mus connected to SW2 in the same network will be having connectivity to back bone or internet or specific server on back bone.
So the customer request was to avoid this situation (partial Mus have connectivity to internet or back bone network or server in the same cluster network) by having Critical resource as an IP address.
Customer Request: - When ever redundancy between the switch goes down, both the switches should check for Critical resource (normally it can be an ip address in the back bone or a server or a gateway to reach external network)
Point 1: - Only when redundancy is broken, then only check for Critical Resource and decide the to unadopt the APs (to solve the above mentioned issue)
Point 2:- Consider in the setup Redundancy is not broken, and critical resource is configured but not reachable, Wireless switch will not check for CR (customer request was only when redundancy is broken check for CR) Provide atleast wireless connectivity to the Mus. this looks good
Point 3:- Consider in the above setup, Eth2 of SW1 goes down and not getting heart beat from SW2, so SW1 and SW2 will check for CR. SW1 will not be able to reach CR, but SW2 will be able to reach CR. As per customer request or the behavior of CR, SW1 will unadopt all the APs adopted to SW1 and all the APs will adopt to SW2. Already adopted APs on SW2 will still remain adopted to SW2. So in SW2 will have all the APs of SW1 and SW2. So all the Mus will be able to reach the CR or back bone network via SW2. This looks good.
Point 4:- Consider the setup is in the same state explained on point number 3, and consider the CR is going down. In this case SW2 will not check for CR availability and it will not disadopt the APs adopted to SW2. This will avoid a wireless network interruption. I feel this behavior looks good, any way the CR is not reachable via SW1 and SW2 so no point of undaopting the APs.
Answering your question
Why a switch should keep the APs if the CR is unreachable, but the switch is still in the cluster? (as per definition of CR monitoring feature)
Ans :- To provide the wireless connection to the mobile units. To avoid the wireless interruption (remember the requirement was when ever the redundancy goes down, then only check for CR and take necessary action unadopt or not other wise don’t disturb the wireless network )
2. If CR is not defined, why would switch ever decide to drop APs when cluster falls apart?
Ans :- If CR is not configured , then the behavior of adoption will be same as 3.x. It will not check for CR and take decision. Here are my examples: 1. An N-switch active-active cluster. Each switch has uplink connection and a separate connection/VLAN for heartbeat. Someone plugs out switches' uplink (or messes with uplink VLAN). The hearbeat is still there (separate link). As per current CR Monitoring definition the switch will keep the aps (CR - out, heartbeat - present). This switch is no longer able to provide service. But still doesn't want to give up APs. Why? I think it's wrong behavior.
Ans :- Recommended topology I have explained above and in my previous reply. ( redundancy and CR should be via same interface, in our example both are via Eth2 and AP adoption via Eth1)
2. A two-switch cluster. CR is not defined. The other switch dies. Heartbeat disappears. Why would the remaining switch want to un-adopt own APs only to re-adopt them later causing major network disruption and reducing all the clustering investment to zero?
Ans:- If CR feature is not enabled, then the switches will not disadopt the APs adopted to the same switch. But in 2.X cluster / redundancy behavior is different Regards Azif
Azif, after the words "customer request" I no longer question the logic of this feature :) Now I understand that this is a custom-request-derived feature which works with a specific setup and in the specific scenario. Yet, should we look a step forward to what a single customer wanted to make this feature generally useful? I speak about the CR-, HB+ scenario. Currently, as you say, the switch "just provides wireless connectivity to the MUs". Which can't get any further than the switch, which makes it rather useless in most scenarios. I wouldn't agree to the statement that my MUs should remain connected to a dead-end switch, while other switches in the cluster can actually provide proper service! All because some customer just hadn't considered this case (the magic of verbal thinking, we all are prone to it). Now, if we start checking for CR availability w/o regard to the cluster state, we would also cover this last possible case and drive this feature to it's logical resolution (i.e. no CR = no service = better someone else handle my APs), which is even more logical, when cluster is alive (big odds that other switches could access CR)! I understand that this is some additional work and the are different ways to do it - from simple alteration of logic, to implementing additional WCCP protocol field/message saying "I can/can't reach CR a.b.c.d", and making switch drop APs only if there's some other switch able to reach the same CR (because someone might configure different CRs for different switches or CR might not be available to all the switches). And this involves some resources that are generally scarce. So, my question now is, should I file the GRIP or not (i.e. what are the odds) and what category it should be ("Nice to have" or smth else)? I must say that this little perk is very handy to include in RFPs to ward competition off, but not in it's current state (others can protest including a that-much-custom-feature).
Alona, setting the Critical Resource feature aside, why would the switch want to drop APs when cluster falls apart? For example, for two-switch cluster it might mean as well that the _other_ switch failed.
It might, it might not. And if critical resource is not defined - it's a guess work. And in this case - since we decided to give up the APs in case other switch has connectivity.
Hi Arsen Bandurian,
Critical resource (CR) means, the Default gateway for the Wireless switch, in most of the case it will be an external router connected to RON of wireless switch. If you are using this feature, keep in mind the redundancy should be enabled via RON (this is the recommended configuration). And as you said it can be a server ip also, if that ip address considered as very critical for the wireless mus.
Working of Critical Resource feature
==========================
When ever a switch lose connectivity to the other redundancy member, immediately all the switches in that redundancy group will check for the CR, if CR is reachable then that switches will not dis-adopt the APs. And consider if CR is not reachable and then that switches will dis-adopt the APs, by guessing any other switches in the redundancy group can reach the CR, so the mus will get connectivity to the default gateway...
Critical resource will be checked by using an ARP packet to the configured IP address. If you get an ARP response means the CR is alive and reachable.
Sample topology uses this feature
==========================
Eth1----------------------------- SW1 ---------------------------------------- Eth2
| |
L2 switch L2 Switch --------------------à Critical Resource (router)
| |
Eth1-------------------------------SW2 ----------------------------------------Eth2
Consider Redundancy is enabled via eth2 of both wireless switches SW1 and SW2, and the default gateway for the wireless switch is only reachable via Eth2 of both switches.
All AP’s are connected via Eth1 of both wireless switches.
Consider SW1 is acting as Primary and SW2 acting as Standby,
If eth2 of SW1 goes down, then all AP’s adopted to SW1 will get dis adopted
SW1 will not be getting the heart beat from SW2, so SW1 will be checking for CR. CR will not be reachable so it will dis-adopt the aps by guessing there will be some other switch in the network will have reachability to CR
SW2 will not be getting the heart beat from SW1, so SW2 will check for CR. CR is reachable, so it will not dis-adopt the aps and it will adopt all the AP’s from SW1
IF CR is down, then the administrator will getting a console log or Syslog or a SNMP TRAP message for rectify the CR problem
This will the same behavior in case of Primary / Standby case or Primary / Primary case
If you have further question please email me azif@Motorola.com
Regards
Azif
Alona, if I understood correctly, this is a means to prevent switch from unadopting APs hastily when heartbeat is lost, but services still could be provided. Right? But why would switch want to unadopt APs in that case anyway? When cluster falls apart (heartbeat is lost) - each switch keeps own APs anyway, right? Furthermore, why would switch want to keep APs if it's lost connectivity to the critical resource but still part of a cluster? Is there a way of forwarding traffic to other cluster members or smth like that (that's the only thing I can think of)? It would make more sense while being a part of the cluster but having lost the connection to critical resource (i.e. "I'm in a cluster but can't provide necessary connectivity - let's kick APs so my cluster mates can take them and provide proper service"). This would work when members of a cluster are, for example, in different datacenters (or just poor guy plugged out the wrong patch cord). Please, help me see this clear.
Keep in mind that our primary goal always - to try and have connectivity for MUs. Meaning - if switch cluster connectivity is lost but critical resource is reachable - switch will not give up its APs. If cluster is lost and critical resource is lost - we will give up APs hoping it will find a better service on another switch. When cluster falls apart - switch will only keep APs if critical resource is still reachable. We only check critical resource if cluster falls apart. We don't forward traffic to another switches unless you enable mobility and then it will forward traffic for home MUs which moved to another switch.
If switch has not lost connection to the critical resource - that means that wireless clients can still connect to wired infrastructure. So there is not a significant reason for the switch to give up the APs, while clients still have a connection to the wired servers. Only if that connection is severed - switch should give up APs in hope that those APs can find a better switch that has a connection.