SUPPORT

Best Practices

ScaleOut Software recommends that you adhere to the following guidelines when deploying and managing a ScaleOut StateServer (SOSS) distributed cache.

Development Considerations

Please follow these best practices when developing your SOSS distributed cache.

DEV-1. Break up large objects for best update performance.
DEV-2. Guidelines for using the deserialized client cache.
DEV-3. Set ASPX pages to use "ReadOnly" session state whenever possible.

Deployment Considerations

Please follow these best practices when deploying your SOSS distributed cache.

DEP-1 Provision sufficient physical memory.
DEP-2 Make sure that the CPU is not overloaded.
DEP-3 Use Gigabit Ethernet (or faster) to prevent network overload.
DEP-4. Use a separate back-end subnet for best security.
DEP-5. Use a single network switch for SOSS when possible.
DEP-6. Use SOSS Remote Client to handle large numbers of clients.
DEP-7. Use banks of Web farms instead of a large monolithic farm.
DEP-8. Use a VPN for GeoServer and Internet Remote Clients for best security.
DEP-9. If you use NLB, Make sure SOSS is on a separate NIC.
DEP-10. Guidelines for sharing sessions in an ASP.NET server farm.
DEP-11. Run the correct version of SOSS on 64-bit systems.
DEP-12. Be sure to enable event handling's configuration parameter.
DEP-13. Be sure to configure host gateways when using the ScaleOut Remote Client Option.
DEP-14. For single-host deployments, use the loopback adapter if a network is not available or to keep hosts from discovering each other.

Management Considerations

Please follow these best practices when managing your SOSS distributed cache.

MGT-1. Track memory usage.
MGT-2. Use sequential shut down process.
MGT-3. Avoid SOSS restart as first step in recovery.
MGT-4. Avoid simultaneous management changes on multiple machines.
MGT-5. Avoid dynamically moving virtual servers running SOSS.
MGT-6. Use a rolling upgrade when upgrading minor and hot fix releases.

Full Descriptions

DEV-1. Break up large objects for best update performance.

If you are storing large, complex objects (e.g., complex datasets) over about 500 KB, heavy update access to these objects can create both substantial networking overhead and significant CPU load to serialize and deserialize the objects. In addition, updates to any portion of these objects require that the entire object be rewritten. Consider breaking large objects up into multiple, smaller objects to increase performance and to allow concurrent access to different portions of the larger objects. This will maximize overall application performance and scalability.


DEV-2. Guidelines for using the deserialized client cache.

The deserialized client cache, which is enabled by default, is a collection of object references retained by the SOSS client library to accelerate multiple retrieve accesses to an object stored in the SOSS service. Changes made by client code to an object held in the deserialized cache may cause the deserialized cache to return the changed object on a subsequent retrieve instead of the version that is stored in the SOSS service.

The deserialized cache is only intended to be used with either read-only access or in a locked retrieve/update usage pattern. In the retrieve/update usage pattern, an object stored in the SOSS service is retrieved and locked, optionally updated, and then unlocked (either by the update or by an explicit unlock access). This pattern enables multiple processes on the same or different servers to reliably update a shared object. No changes should be made to the retrieved object outside of the retrieve/update pair because these changes would not be propagated to the distributed cache.

Sometimes a client application may want to retrieve an object from SOSS, make some changes to the object, and then discard it after using it without persisting the changes to the SOSS service. This usage model can cause the deserialized cache to get out of sync with SOSS. If an application requires that changes be made to a retrieved object outside of the above read/update usage pattern, steps should be taken to avoid corrupting the deserialized cache. Either the cache should be disabled by setting the AllowDeserializedCaching property to false, or a deep copy of the retrieved object should be made with changes made only to this copy. Doing this will keep the locally cached object from getting out of sync with the authoritative object that is kept in the SOSS service.


DEV-3. Set ASPX pages to use "ReadOnly" session state whenever possible.

When using session state in your ASP.NET application, setting the @ Page directive's EnableSessionState attribute to ReadOnly wherever possible can offer significant performance benefits, regardless of whether you're using ScaleOut StateServer's session state provider or one of ASP.NET's built in providers. The default value for the EnableSessionState attribute is true, which means that the entire session dictionary is read from the backing store at the beginning of every Web request and then updated at the end of each request, regardless of whether you changed any values in the session dictionary. If you know that a page will never update session state, set the attribute to ReadOnly to prevent the updates from occurring. This will cut the number of round trips to the backing store in half and eliminate unnecessary updates.

Reducing unnecessary updates is useful when using any session provider, but it is especially effective when using ScaleOut StateServer's provider because of ScaleOut StateServer's client-side caching feature. If the session data isn't constantly being updated in the ScaleOut StateServer's distributed cache then the client-side cache will keep the most recently accessed session data inside of your ASP.NET process. This gives you a huge performance benefit by speeding up accesses and reducing network and deserialization overhead when sessions are being read from the distributed cache.


DEP-1. Provision sufficient physical memory.

To maintain high performance and stability, it is important that the SOSS service always runs in physical memory. Otherwise, the service will start paging to disk, which will cause the distributed cache to become unresponsive. SOSS's memory requirements for each host in the distributed cache include the following different components:

  • memory required for the host's portion of stored objects
  • memory required for object replicas
  • extra memory to handle dynamic load-balancing
  • extra memory to handle recovery from host failures
  • extra memory for other SOSS data structures

The formula to provision memory for objects and replicas is as follows. First, determine a maximum expected storage requirement for the distributed cache by multiplying the average object size by the maximum number of objects expected to be stored at any one time; call that number M bytes. The total memory required for object storage in the SOSS cache will then be R*M, where R is the number of replicas per object (1 or 2). The memory needed per host is then R*M/N, where N is the number of hosts in the server farm. Note that the memory required per host decreases as hosts are added for a given total cache size.

For example, if you need to store 300MB of object data on a 6-server farm, the distributed cache will need to store 600MB total if one replica is used per object, and each host then needs 100MB of available RAM for stored objects.

We strongly recommend that you provision servers with at least 50% more memory than the above calculated minimum requirement to handle the "extra" memory needs listed above. In addition, the amount of extra memory you need to handle host failures depends on the number of failures that the distributed cache must handle and the number of servers in the farm. The amount of required extra capacity decreases as you add servers to the farm. If you have four hosts, the surviving three hosts together have to handle an extra 25% of the total workload (about an extra 8.4% per host) after the first host failure. Two surviving servers would have to handle 50% of the workload after a second failure.

The basic memory overhead of the ScaleOut StateServer service itself is about 2MB for each host running the SOSS service. Also, a small amount of additional memory is used per-object.

You also can elect to run SOSS in a fixed amount of memory per host by setting the configuration options to trigger memory reclamation when a specific memory limit is reached. This usage model must match the requirements of your application so that objects are not unexpectedly removed from the distributed cache. Even in this usage model, extra memory needs to be provisioned to handle load-balancing and failover.


DEP-2. Make sure that the CPU is not overloaded.

The SOSS service typically uses less than 5-8% of the CPU under moderate load. If the CPU becomes overloaded, SOSS may be unable to maintain service. Other SOSS hosts may detect this problem by missing heartbeat messages from the overloaded server and may then remove the overloaded server from the distributed cache.

Also, it is important to provision extra CPU capacity to handle host failures. To ensure the continued operation of SOSS when a host failure occurs, the remaining hosts in the farm need to be able to take up the load. For example, if each host in a three server farm has a CPU utilization of 90% and one host fails, the other two will likely become overloaded. However, if each server's CPU were being utilized at 30%, then the remaining two should each be able to absorb the additional 15% load and still maintain service.


DEP-3. Use Gigabit Ethernet (or faster) to prevent network overload.

Because of the need to replicate objects during updates and access remotely stored objects, SOSS generates enough network traffic to warrant the use of gigabit Ethernet to interconnect SOSS hosts. If you are using virtual servers, a fast network is especially important. For large computational grids with high access rates, an Infiniband network will avoid network saturation and allow SOSS to maintain linear scalability.

SOSS typically uses less than 10% of network bandwidth. Any sustained network usage that is over 40% should be investigated. In practice, networks probably can only sustain about 50% of their "rated" bandwidth. For example, a 100 Mbps network probably can sustain only about 50 Mbps.

You can estimate the maximum required network bandwidth, B bytes/sec, for an SOSS distributed cache on the network that interconnects the SOSS hosts as follows:

B = (reads/sec * object size) + (updates/sec * object size * (R+1))

where R is the number of replicas per object (which can be set to 1 or 2). Additional network bandwidth may be required to connect remote clients to the distributed cache.

For example, a distributed cache with 125 reads/sec and 125 updates/sec for 100KB objects, would use about 37.5 Mbytes/sec, or about 375 Mbits/sec allowing for data encoding overhead. (This is beyond the theoretical saturation point for Fast Ethernet.)

Note that this is a worst case bandwidth estimate based on a pattern of repeated read/update pairs. SOSS's internal client and server caches optimize network usage by eliminating repeated reads. However, SOSS also uses a small amount of network bandwidth for cache validation checks, point-to-point heartbeats, and multicast discovery.


DEP-4. Use a separate back-end subnet for best security.

If your "front-end" network subnet is connected to the Internet (for example, in a Web farm), you should consider configuring ScaleOut StateServer for use on a separate, secure network subnet, typically a firewalled, "back-end" network used to connect your servers to an internal database server. This will enhance security for the data in the SOSS cache.


DEP-5. Use a single network switch for SOSS when possible.

You should connect all servers to the same network switch when this is feasible and matches your bandwidth needs. If SOSS servers are connected across two or more switches and a connection between the switches fails, the distributed cache will partition itself into two separate caches. To maintain data integrity after this "split brain" situation is corrected, SOSS usually has to restart the hosts connected to one of the two switches; this can lead to the loss of the latest updates to the distributed cache.

For the same reason, you also should avoid splitting a single SOSS distributed cache across two physical sites, such as two data centers in different cities. In addition to the risk of the above "split brain" issue should a link between sites fail, a single WAN link usually causes performance problems due to its relatively low bandwidth and high latency in comparison to a LAN subnet. Instead, consider using ScaleOut GeoServer® to couple two or more distributed caches and replicate updates across WAN links. This product was designed specifically to solve the problem of maintaining cached data that is shared across two or more sites.


DEP-6. Use SOSS Remote Client to handle large numbers of clients.

If you have a large number of Web or application servers (10 or more) accessing the same distributed cache, consider using the SOSS Remote Client option to create a separate caching tier accessed by the Web and/or application server (or grid server) tiers. This approach will allow you to optimally configure the caching servers which are dedicated for this task. For example you could use 64-bit, multi-core caching servers provisioned with a large amount of memory and a dedicated back-end gigabit or Infiniband network. The use of a dedicated caching farm also allows you to reduce disruptions to the caching servers when you reboot Web servers for maintenance purposes.


DEP-7. Use banks of Web farms instead of a large monolithic farm.

To maintain flexibility of operations and higher overall availability for Web applications, consider using a topology that includes multiple banks of server farms rather than a single large farm. In this configuration, an IP load-balancer would maintain affinity of Web or application clients to each bank, and each bank would run a separate SOSS distributed cache. This topology allows an entire bank to be taken down for maintenance or taken offline when traffic is slow. It also avoids a networking bottleneck and improves availability in case of networking failures which would otherwise result in an outage of the entire farm.


DEP-8. Use a VPN for GeoServer and Internet Remote Clients for best security.

For high performance, SOSS does not encrypt communications between SOSS hosts and clients. You should consider using a virtual private network (VPN) to connect SOSS stores using the ScaleOut GeoServer option. This provides secure communications between sites and allows you to use gateway addresses that are only routable across the VPN.

When using the ScaleOut Remote Client Support option, you also should use a virtual private network (VPN) to connect remote clients to an SOSS store if the remote clients access the SOSS store over a public network. This provides secure communications from remote clients located outside the SOSS store's data center and allows you to use gateway addresses that are only routable across the VPN.


DEP-9. If you use NLB, make sure SOSS is on a separate NIC.

ScaleOut StateServer is designed to work seamlessly with load balancers such as Network Load Balancing (NLB) in the Windows Server operating system. However, you should not use ScaleOut StateServer on the same network interface that has NLB installed and enabled. NLB filters incoming multicast network traffic, and this blocks ScaleOut StateServer's multicast management messages. Instead, be sure to configure ScaleOut StateServer to use a different network interface.


DEP-10. Guidelines for sharing sessions in an ASP.NET server farm.

After installing ScaleOut StateServer to store ASP.NET session objects, the following additional steps must be taken to ensure that session objects are visible to all servers in a server farm. These steps are required whenever any "out of process" session provider (e.g., a database server or a distributed cache) is used to store ASP.NET session objects; they enable ASP.NET to use a common mechanism for identifying session objects across the server farm.

  1. In your .NET configuration, confirm that the setting is identical on all of the servers, and make sure that it is not set to "autoGenerate". Be sure to use an explicitly defined validationKey and decryptionKey. There's a convenient online tool that you can use to generate an explicit key at ASP.NET Resources.
  2. Confirm that the path to the application in IIS is exactly the same on all of your servers. Folder names must be capitalized identically, too.
  3. Is the Web application hosted somewhere other than the default Web site in IIS? If not, applications that run out of other sites will need to have their application paths synchronized in the IIS metabase. You can review the Microsoft knowledge base article (325056) that discusses how to address this issue.

DEP-11. Run the correct version of SOSS on 64-bit systems.

It is possible to install and run the 32-bit version of SOSS on a Windows x64 operating system. However, if you are running the 64-bit version of Microsoft's IIS Web server on a Windows x64 operating system, be sure to install the 64-bit version of SOSS instead of the 32-bit version. Otherwise you may get the following exception: "Could not load file or assembly 'soss_svcdotnet' or one of its dependencies. An attempt was made to load a program with an incorrect format."

If you need to run IIS in 32-bit compatibility mode on a 64-bit system, you can obtain SOSS 32-bit client libraries that are compatible with the 64-bit version of SOSS. Please visit www.scaleoutsoftware.com to download these libraries.


DEP-12. Be sure to enable event handling's configuration parameter.

If your application needs to catch asynchronous events, such as object timeouts, be sure to enable event handling on every SOSS host by setting the max_event_tries configuration parameter to a non-zero value; a typical value would be 3. Otherwise, the SOSS host will not deliver events to your application. By default, this parameter is set to zero to avoid unnecessary client/server communication when event handling is not used. For more information, please see the Configuration Parameters section of the SOSS help file.


DEP-13. Be sure to configure host gateways when using the ScaleOut Remote Client Option.

When using the ScaleOut Remote Client option, it is important to properly configure the Gateway Information on every SOSS host. By default, these gateways are set to the IP addresses used by each SOSS host to communicate with its peers on the selected network interface. If remote clients need to access SOSS hosts using a different network subnet, failing to configure the SOSS host gateways will cause loss of connectivity to the SOSS distributed cache.

For example, assume two SOSS hosts communicate with each other on a back-end 10.0.1.x network and that they have the IP addresses 10.0.1.1 and 10.0.1.2 respectively. If remote clients access the SOSS hosts using the 10.0.1.x subnet, no gateway reconfiguration is needed. However, if these SOSS hosts are also connected to a separate front-end network, for example, 192.168.1.x using IP addresses 192.168.1.1 and 192.168.1.2 respectively, and remote clients access the SOSS hosts using these IP addresses, then the SOSS gateways must be configured with the front-end IP addresses instead of the default back-end addresses.

In the above example, the remote clients also would be configured to populate their client configuration files with the 192.168.1.x IP addresses so that the clients can find the SOSS hosts. After initially connecting to an SOSS host, the client libraries automatically download the host gateways for client access to the distributed cache. If the gateways are configured with back-end IP addresses not reachable from the remote clients, the clients will then lose connectivity to the SOSS hosts.

Please see the section Configuring the Remote Client Option in the SOSS help file for full details on configuring remote clients.


DEP-14. For single-host deployments, use the loopback adapter if a network is not available or to keep hosts from discovering each other.

ScaleOut StateServer requires a network connection and must be bound to an IP address or subnet for its normal operations. This enables the caching service to detect if a NIC has failed or if the network switch has lost power. If you are running a single-host SOSS cache for evaluation purposes, you can use the Microsoft loopback adapter to create a virtual network environment on standalone servers or laptops that do not have network connectivity. The loopback adapter can also be used to prevent SOSS hosts running on development machines from discovering each other.

  1. Install the loopback adapter (instructions).
  2. Assign a static IP address to the new connection using Windows' TCP/IP properties dialog.
  3. Use the SOSS Console application to configure the ScaleOut StateServer service to use the new network interface.

MGT-1. Track memory usage.

To maintain high performance and stability, it is important that the SOSS service always runs in physical memory. Otherwise, the service will start paging to disk, which will cause the distributed cache to become unresponsive and to display the yellow "not ready" icon in the SOSS console for one or more hosts (i.e., servers). On Windows, you also can track the amount of available physical memory and the "page fault delta" for the SOSS service process (soss_svr.exe) using the Windows Task Manager; on Red Hat Linux, you can use the System Monitor.

If the amount of available physical memory falls below about 200 MB or if the page fault delta remains non-zero for more than a second or two at a time, the SOSS service has insufficient physical memory. Corrective action needs to be taken immediately. You should provision more physical memory, and/or you can add more hosts to the distributed cache (so that the memory load is spread across more servers). Please see the above section on provisioning memory for an SOSS cache.

Note that SOSS currently does not return dynamically allocated memory back to the operating system after a period of peak usage based on the assumption that this memory will be needed again. (This will be offered ina future release.) You can control SOSS's peak memory usage and eviction policy using SOSS's configuration parameters. This will keep the service's memory usage constrained to a desired limit.


MGT-2. Use sequential shut down process.

Never shut down a server running the SOSS service or restart the service unless you have first issued the SOSS Leave Command (or the "soss leave" command in a console window) and wait until this command has fully completed. Full completion is signaled by the SOSS console when the corresponding host's icon turns to a red circle (with an embedded square) and the host's status is marked as inactive. Allow up to two minutes for a Leave command to complete on a heavily loaded SOSS store. If you are shutting down multiple servers, you should allow enough time for each server to fully complete its leave operation before shutting down the next one unless you run the "Leave all hosts" command to stop the entire SOSS store.

If an active SOSS service process is stopped prematurely, SOSS will invoke its recovery mechanisms, which affect cache performance and unnecessarily disrupt normal operations. Using the "Leave" command allows SOSS to synchronize the load rebalancing and membership change with ongoing cache accesses. This ensures smooth rebalancing of the distributed cache with minimum impact on performance.


MGT-3. Avoid SOSS restart as first step in recovery.

If you should encounter a problem with an SOSS server, such as you see a persistent yellow icon in the SOSS console indicating a host that is not ready, do not restart all SOSS hosts as the first step in remedying the problem. Under normal operations, the SOSS console temporarily displays a yellow icon if network congestion and/or memory paging occur when the distributed store is adding replicas or rebalancing the load. If you are using the SOSS console, make sure to highlight the "Local Store" icon in the left-hand tree list. This ensures that the host and store status refreshes every few seconds and that you are seeing the latest indications.

If SOSS detects a server or network outage, the distributed cache automatically recovers in most cases, and this recovery usually completes within several seconds (up to a minute in some circumstances). If a problem persists, first wait several seconds to see if the SOSS console's "not ready" condition clears on its own. If necessary, next try killing the service process (soss_svr.exe) only on the suspect server and give SOSS several seconds to self-heal. If this fails to resolve the problem and another SOSS host persists in a "not ready" condition, try killing the service process for that host and let the distributed cache self-heal. If the problem persists, it may be necessary to restart the SOSS service process on all hosts; this is rarely necessary.

In no case do you need to reboot the operating system on an affected host; it is always sufficient to kill and restart the SOSS service process. (Note that the SOSS console's "Restart" command may not be sufficient to kill and restart an SOSS host because it relies on the host to first successfully leave the store before a service restart is attempted.) On Windows, you can kill the service process with the Task Manager and restart it with the "net start soss" command or by using the Windows SCM. On Red Hat Linux, you can use the Services Configuration Tool to kill and restart the sossd daemon. After restarting and rejoining the SOSS service, you may have to restart the local host's client application. You can use the iisreset command on Windows to restart the IIS Web server.


MTG-4. Avoid simultaneous management changes on multiple machines.

You should not simultaneously make changes to configuration parameters by running the management tools on different hosts. By doing so, it is possible that multiple hosts could record inconsistent parameter values. Changes to ScaleOut StateServer's configuration should only be made using a single management tool at a time.


MGT-5. Avoid dynamically moving virtual servers running SOSS.

SOSS supports the use of virtual servers, and many customers make wide use of them with SOSS. However, SOSS's mechanisms to maintain its membership may be disrupted if a virtual server is dynamically moved to another physical server, for example, by using VMWare's VMotion utility. SOSS uses point-to-point heartbeat messages to detect membership changes due to server or networking outages, and live virtual server migration can introduce delays that SOSS detects as outages. This will trigger recovery actions and eventually could lead to data loss.


MTG-6. Use a rolling upgrade when upgrading minor and hot fix releases.

Starting with version 3.0, SOSS ensures that minor and hot fix releases are backwards compatible with previous minor releases within the same major release. For example, version 3.1.6 is backwards compatible with version 3.1.5. This lets you maintain service to applications while upgrading individual SOSS hosts one at a time.

To perform a rolling upgrade to the next SOSS version, take the following actions for each host in turn, one host at a time. Make sure that you fully complete all steps and then verify that the distributed cache is running normally before upgrading the next host:

  1. Open the SOSS management console on the host, select the local host from the list on the left, and go to the "Host Status" tab.
  2. Click "Leave" to cause the host to leave the distributed cache.
  3. Wait for the host to completely finish leaving. Its icon will change from green with a triangle, through yellow with a minus, to red with a square)
  4. Uninstall SOSS on this host.
  5. Install the new version of SOSS on this host.
  6. Open the SOSS management console and configure the host as necessary. Note that the host should automatically update its license key entry from other SOSS hosts when the SOSS service process restarts and detects an SOSS distributed cache.
  7. In the SOSS console, go to the "Host Status" tab and click the "Join" command to join the distributed cache.
  8. Wait for the host to finish joining. Its icon will change from red with a square, through yellow with a plus, to green with a triangle.
  9. Verify that load-balancing has completed and that the distributed cache is running normally before upgrading the next host.
   

©ScaleOut Software Inc, 2003-2007. All rights reserved. ScaleOut StateServer and ScaleOut GeoServer are trademarks of ScaleOut Software, Inc.   Privacy Policy and Terms of Use.