Welcome, Guest Login

Support Center

Recovering Momentum 4

Last Updated: Jan 29, 2016 02:33PM EST

Operating Momentum
Recovering Momentum 4

Table of Contents

Shutdown/Startup Momentum 4

If you feel it is necessary to re-boot a server, there are some items to keep in mind prior to the re-boot.

  • Maintain the quorum. Clustered databases are picky about quorum. If the majority of servers is not up, Vertica will automatically shut down and manual intervention will be needed to get it up and running. If multiple servers need to be re-booted, re-boot one server at a time and verify that the server is up before re-booting the next server. For more information on quorums, see the Momentum 4 Reference Guide Glossary.
  • Shut down databases gracefully. If at all possible, shut down the databases gracefully prior to re-booting the server.

Controlled Shutdown of Momentum 4

To shut down all processes associated with the Momentum application suite, issue the following commands in this order:

/etc/init.d/msys-nginx stop
/etc/init.d/ecelerity stop (Note: If the node will be down for a while, allow the rabbitmq queues to drain before proceeding)
/etc/init.d/msys-app-webhooks-api stop
/etc/init.d/msys-app-users-api stop
/etc/init.d/msys-app-metrics-api stop
/etc/init.d/msys-app-adaptive-delivery-api stop
/opt/msys/3rdParty/cassandra/bin/nodetool drain
/etc/init.d/msys-cassandra stop
/etc/init.d/msys-vertica stop
/etc/init.d/msys-riak stop
/etc/init.d/msys-app-webhooks-transmitter stop
/etc/init.d/msys-app-webhooks-batch stop
/etc/init.d/msys-app-metrics-etl stop
/etc/init.d/msys-app-adaptive-delivery-etl stop
/etc/init.d/msys-rabbitmq stop
/etc/init.d/eccmgr stop (Note: Exists only on the first platform/manager/log aggregator node)
/etc/init.d/ecconfigd stop (Note: Exists only on the first platform/manager/log aggregator node)

Controlled Startup of Momentum 4

To start all processes associated with the Momentum application suite, issue the following commands in this order:
/etc/init.d/ecconfigd start (Note: Exists only on the first platform/manager/log aggregator node)
/etc/init.d/eccmgr start  (Note: Exists only on the first platform/manager/log aggregator node)
/etc/init.d/msys-rabbitmq start
/etc/init.d/msys-app-metrics-etl start
/etc/init.d/msys-app-webhooks-transmitter start
/etc/init.d/msys-app-adaptive-delivery-etl start
/etc/init.d/msys-riak start
/etc/init.d/msys-vertica start
/etc/init.d/msys-cassandra start
/etc/init.d/msys-app-adaptive-delivery-api start
/etc/init.d/msys-app-metrics-api start
/etc/init.d/msys-app-users-api start
/etc/init.d/msys-app-webhooks-api start
/etc/init.d/ecelerity start (Note: All services started so far should be up prior to starting ecelerity)
/etc/init.d/msys-nginx start
/etc/init.d/msys-app-webhooks-batch start

Maintenance Steps After Momentum or a Server Goes Down

Node Down

If Momentum (or the entire node) goes down, the system should recover on its own running with only two nodes. All IPs/Bindings that were bound to the down server will  automatically migrate to the remaining servers via DuraVIP functionality. After IPs have migrated to the remaining servers, you must run the ARPing maintenance command to establish solid connectivity between the newly bound IP addresses and the remote gateway. The ARPing command is shown below:

On analytics nodes:
for i in $(ifconfig | grep "inet addr" | awk -F: '{print $2}' | awk '{print $1}' | grep -v \
  "127.0.0.1" | sort); do printf "IP: %s:  " $i; arping -s $i -I bond0 -c 5 abc.def.lmn.xyz; \
  done

On platform nodes:
for i in $(ifconfig | grep "inet addr" | awk -F: '{print $2}' | awk '{print $1}' | grep -v \
"127.0.0.1" | sort); do printf "IP: %s:  " $i; arping -s $i -I bond0 -c 5 abc.def.lmn.xyz; \
  done

To verify full connectivity of all IPs bound on a node, issue the following command:
for i in $(ifconfig | grep "inet addr" | awk -F: '{print $2}' | awk '{print $1}' | grep -v \
  "127.0.0.1" | sort); do printf "IP: %s:  " $i; nc -z -w5 -s $i emailserverdomain 25; done
If full connectivity is present on the remaining nodes, the system should deliver messages as normal.

Node Back Up

Once the downed node comes back up, the ARPing script should be run on all nodes.  After the ARPing command has run successfully on all nodes, full connectivity of all IPs bound to the nodes should be verified with the connectivity command shown above.

After connectivity has been verified on all nodes of the cluster, Momentum should be restarted on all non-downed nodes during the event to ensure that all queues have successfully sent all messages during the event. Issue the following command to restart Momentum:
/etc/init.d/ecelerity restart

Restarting a Process

When restarting a process, all processes that bind to that process must also be restarted. For example, when restarting RabbitMQ, you also restart the following:

  • msys-app-metrics-etl
  • msys-app-adaptive-delivery-etl
  • msys-app-webhooks-batch
  • msys-app-webhooks-transmitter
Restarting Cassandra
Properly restarting Cassandra is a multi-step process.  Note that Cassandra resides on Platform nodes.
 
If at any time in this process you encounter an error, contact Message Systems Support. It is an anomaly when Cassandra cannot restart.

1. Ensure the C* cluster is healthy:
[plat1:~]$ /opt/msys/3rdParty/cassandra/bin/nodetool status

You should get output similar to:
Note: Ownership information does not include topology; for complete information, specify a keyspace
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address        Load       Tokens  Owns   Host ID                               Rack
UN  10.10.10.001   19.06 MB   256     33.9%  943cb61b-6839-42c2-ba1d-c7ba3f046f11  rack1
UN  10.10.10.002   32.35 MB   256     32.6%  45cb5eb7-e28c-4b54-92a8-9a68945fbb31  rack1
UN  10.10.10.003   32.32 MB   256     33.5%  b0918fcf-799f-4288-9764-fbd436e8b785  rack1
Note that node health is actually a two letter readout. A healthy node will read UN (up and normal).
If the cluster is not healthy, contact Message Systems Support. If you are not sure, contact Message Systems Support.

2. Drain all data in the Cassandra instance to disk:
[plat1:~]$ /opt/msys/3rdParty/cassandra/bin/nodetool drain
This command produces no output but should exit with 0. Note that drain does do a flush of memtables to disk, which could trigger a compaction that gets interrupted during the shutdown step. This should be fine, since it is likely the compaction just started as a result of this flush, and will complete after restart. 

3. Restart/stop Cassandra:
[plat1:~]$ sudo /etc/init.d/msys-cassandra restart
Stopping cassandra: .                                      [  OK  ]
Starting cassandra: ..............                         [  OK  ]
 
The start-up script waits for the Cassandra process to begin listening on port 9160. This typically takes about 45 seconds; however, sometimes it can take longer than that, which makes it appear that Cassandra has failed to start. This usually happens because Cassandra is busy taking care of some housekeeping.

If you see a start-up failing, tail /var/log/msys-cassandra/system.log to review the process.

4. Ensure the restarted Cassandra cluster is healthy:​
[plat1:~]$ /opt/msys/3rdParty/cassandra/bin/nodetool status
Note: Ownership information does not include topology; for complete information, specify a keyspace
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address        Load       Tokens  Owns   Host ID                               Rack
UN  10.10.10.001   174.73 KB  256     13.7%  56d07eb0-1947-4f33-92aa-f145eb7a8563  rack1
UN  10.10.10.002   180.46 KB  256     13.9%  3e055b85-2c50-4a07-b333-95c4a53ae264  rack1
UN  10.10.10.003   179.4 KB   256     14.9%  2f035b50-1b0b-4df5-9012-70b41174363f  rack1
UN  10.10.10.004   175.08 KB  256     14.9%  8926719c-51a6-4329-b51d-256f17b03db3  rack1
UN  10.10.10.005   267.54 KB  256     13.6%  cd1f4350-2833-4b77-897b-2dfde6c2de7e  rack1
UN  10.10.10.006   129.25 KB  256     14.7%  e07938cc-9957-4855-ae62-a12d41d4bfbd  rack1
UN  10.10.10.007   161.23 KB  256     14.3%  5d9094dc-e972-4c2f-a9a9-8ab191fb1444  rack1
It is also a good idea to check the logs:
[plat1:~]$ tail -n10 /var/log/msys-cassandra/daemon.log
 INFO 21:58:32,196 Node /10.10.10.001 has restarted, now UP
 INFO 21:58:32,199 Node /10.10.10.001 state jump to normal
 INFO 21:58:32,199 Handshaking version with /10.10.10.002
 INFO 21:58:32,202 InetAddress /10.10.10.001 is now UP
 INFO 21:58:32,225 InetAddress /10.10.10.003 is now UP
 INFO 21:58:32,226 Node /10.10.10.003 state jump to normal
 INFO 21:58:32,253 Node /10.10.10.004 state jump to normal
 INFO 21:58:32,281 Node /10.10.10.005 state jump to normal
 INFO 21:58:32,307 Node /10.10.10.006 state jump to normal
 INFO 21:58:40,028 No gossip backlog; proceeding
If you see errors you do not understand, contact Message Systems Support.

Disaster Recovery

Momentum is a highly scalable and customizable message management platform. When deployed in a cluster, it is designed for fault tolerance and high availability, thus eliminating all single points of failure and providing active/active server fail-over. However, various failures can still occur in a production system, including:

  • Entire cluster failure
  • Entire server failure (including storage)
  • Complete storage failure
  • Partial storage failure
  • Server failure (no storage failure)

All storage used with Momentum should be backed up by a redundant array of independent disks (RAID) and proactively monitored and fixed when necessary. The sections below describe how to recover from a catastrophic failure of the Momentum components.

Spool

The spool is designed to be transient storage for messages. It is stored under /var/spool/ecelerity and is constantly updated as the MTA receives and delivers messages. The spool should be backed up by a RAID storage.

The spool can be corrupted when the underlying backing storage encounters unrecoverable failures. If the spool partition becomes corrupted or encounters issues, it cannot be recovered. Therefore, if an outage occurs, it is nearly impossible to determine which mailings were sent and which were not. In that situation, it is best to consider the mailings that occurred at the time of the outage to be sent, and continue from there.

When the spool is not corrupted, it can be imported into a different installation.

For a new installation (i.e., one that has not sent any mail), copy the old spool into
/var/spool/ecelerity. This will be picked up when the node first starts and messages will be sent.

If an outage occurs on a single node, the simplest solution is to move the directory to another machine/node:

  1. Copy /var/spool/ecelerity and all sub-directories to a new directory on the new machine. NOTE: Do not copy them into the /var/spool/ecelerity directory on the new machine.
  2. Ensure that the directory structure and all sub-directories are owned by ecuser:ecuser.
  3. Run the ec_console command /opt/msys/ecelerity/bin/ec_console.
  4. Run the command spool import <path to directory of copied spool>. This will import the entire listed directory.
  5. ​After you see no more receptions in the mainlog.ec, run the ec_console command spool import_poll /path/to/copied/spool to verify that the spool import has completed.
  6. Delete the spool you copied from the other machine (the local node will have imported these directories into /var/spool/ecelerity).

MTA Configuration Files

Message Systems strongly advises that you periodically back up the configuration files in order to prevent serious downtime. Typically, you can copy the configuration files from other nodes in the cluster. However, if the entire cluster is lost, your only solution is to restore from a backup.

Backup

Perform the following backups on the first Momentum node (Manager). You may also want to copy the directory as a precaution, in case the dump or export is corrupted.

1. Backup the default directory, which includes all scripts, dkim keys, and other entries.
tar cvf /tmp/ecelerity.etc.tar /opt/msys/ecelerity/etc --exclude .svn
2. Backup the RabbitMQ configuration and optional ODBC changes.
tar cvf /tmp/3rdParty.etc.tar /opt/msys/3rdParty/etc
3. Backup the Cassandra configuration.
tar cvf /tmp/cassandra.conf.tar /opt/msys/3rdParty/cassandra/conf
4. Backup the NGINX configuration.
tar cvf /tmp/nginx.conf.tar /opt/msys/3rdParty/nginx/conf
5. Backup the Vertica configuration.
tar cvf /tmp/vertica.conf.tar /opt/vertica/config
6. Backup the lastinstall file and installation logs. (They can be used as a reference when reinstalling.) The lastinstall file is located under /opt/msys/etc/installer/lastinstallThe install log is in the directory used to install Momentum, and appears in the following format: install20110928T081812.log.
7. Backup the Subversion repository. This is performed as a dump, which maintains the configuration history.
svnadmin dump /opt/msys/ecelerity/etc/conf > /tmp/repo1.dump
8. Backup the application configurations.
tar cvf /tmp/app.configs.tar /opt/msys/app/*/config/production.json

Restore
When performing a restore, replace the failed MTA with a new MTA. Reuse the same IP address and host name. To restore a failed Manager node (first Momentum node), follow the steps below:

1. Start with a clean install of the manager in order to create the directory structure and the svn repository.
2. Restore from the backup of directories, which can be done after a clean install on the manager. Alternatively, you can restore a Subversion repository from a dump file by running the following command:
svnadmin load /opt/msys/ecelerity/etc/conf < /tmp/repo1.dump
3. Restore directly into the /opt/msys/ecelerity/etc directory.
4. Run a commit with the --add-all switch instantiated.
5. Restore Cassandra data from a snapshot. The only supported Cassandra DR scheme is to perform regular backups and restore from the backup. For more information, see the Datastax documentation.
6. Restore Vertica data from a snapshot. The only supported Vertica DR scheme is to perform regular backups and restore from the backup. For more information, see the HP Vertica documentation.
7. Start the services.
 
 
Previous Article Introduction Next Article

 
3d340ddab8604c9deb2bbcad29739042@messagesystems.desk-mail.com
https://cdn.desk.com/
false
desk
Loading
seconds ago
a minute ago
minutes ago
an hour ago
hours ago
a day ago
days ago
about
false
Invalid characters found
/customer/en/portal/articles/autocomplete