Monday, February 22, 2016

Troubleshooting-Hardware

Hardware/Interface/ARP/Routing/Connections/CPU/RAM


cat /proc/cpuinfo | egrep "MHz|model name"
cat /proc/meminfo | grep MemTotal
 /usr/sbin/dmidecode | grep "Product Name"
         


dmidecode | egrep -i "serial|product"
clish -c "show asset all"
clish -c "show sysenv all"
(same as cisco show env)

For the CPU details use cat /proc/cpuinfo
For the RAM details use cat /proc/meminfo | grep MemTotal

The output of the dmesg command and the /var/log/ should examined for hardware errors
and Critical error messages and logs    can be also very helpful

Example of errors in var/log/messages  :
wd0: interrupt timeout:
wd0: status 58<seekdone,drq> error 0


HOW DO I SHUT AND UNSHUT AN INTERFACE
#ifconfig <interface name > down
#ifconfig <interface name > up

netstat -i
more /etc/sysconfig/hwconf | grep eth* | grep -v detached 
(to find mac)
cpstat os -f ifconfig  ( interface ip /mac/mtu/description)
ifconfig -a  (ifconfig eth4)
cphaprob -a if (only for claster)
fw ctl iflist   (which lists just the interface names and number ,used for some fwmonitor cap)
fw getifs    (summary display of IP addresses per interface)
ethtool -S eth0  (to see errors on interface )
ip add show
clish -c "show interface eth1"
clish -c "show configuration interface"
NOTE:you can monitor for errors on interface using watch command 
watch -n 1 "netstat -i"

by default it is 2 sec and with -n you can specify time 
watch -n 0.1 "cphaprob -a if"
watch -n 1 " ethtool -S eth1 | grep errors "


To check speed and duplex on all interfaces script:
for ii in $(ifconfig | awk ' /Ethernet/ {print $1}') ;do ethtool $ii; done | egrep  'eth|Speed|Duplex'

To see 4 top talkers on interface (in this case eth0)
tcpdump -tnn -c 20000 -i eth1 | awk -F "." '{print $1"."$2"."$3"."$4}' | sort | uniq -c | sort -nr | awk ' $1 > 100 '

fw monitor -e “bad_nets = static {<194.1.0.0,194.1.255.255>} ;accept src in bad_nets and ifid=0;” (ifid ve can see in (fw ctl iflist ) command)
packets originated in range of networks 194.1.0.0 – 194.1.255.255 and captured on interface eth3 only


Test ARP
arping -I eth1 1.1.1.1  (def is eth0)
arping -s <source ip> 10.0.01  (in case you have multiple ip on interface )

arp -an | wc -l   or   arp -av | grep Entries  (to see number of arp )
arp -i eth6   (to see all arp on eth6)
cd /proc/net/arp
clish -c "show configuration arp"
clish -c "show arp table cache-size"  (Default: 1024, Range: 1024-16384)
clish -c "show arp static all"
clish -c "show arp dynamic all" | grep 1.1.1.1
clish -c "show arp proxy all"
ip neigh show

Flush arp entry for host 10.20.30.40
arp -d 10.20.30.40

Flush all arp enties on interface eth0:
ip neigh flush dev eth0

Flush arp entry for host 10.20.30.40:
ip neigh flush 10.20.30.40

Flush arp entry for all hosts in network 192.168.0.0/24
ip neigh flush 192.168.0.0/24

Clear all arp table
clish
delete arp dynamic all

To check is any manual proxy arp configured :
$FWDIR/conf/local.arp

To change arp config :
clish
set arp table cache-size VALUE


Troubleshoot Gaia Routing
General Commands:

ip route show to match x.x.x.x
ip route get x.x.x.x
clish -c"show route destination  x.x.x.x "
show route (in iclid shell)
netstat -rn | grep x.x.x.x
netstat -rn | wc -l (number of routes)
cpstat os -f routing
clish -c "show configuration ospf"
clish -c "show configuration static-route"


Add route
set static-route VALUE nexthop blackhole
set static-route VALUE nexthop gateway address VALUE off
set static-route VALUE nexthop gateway address VALUE on
set static-route VALUE nexthop reject
set static-route VALUE off



Restart Routing Process
ps auxw | grep -v grep | grep -E "PID|routed" (to see is process running)
drouter stop && drouter start
drouter start           -       Start Dynamic Routing daemon
drouter stop            -      Stop Dynamic Routing daemon

tellpm process:routed t    (start routing process)    (survives reboot)
tellpm process:routed     (stop routing process)    (survives reboot)


cat /proc/cpuinfo
cpstat -f cpu os
cpstat -f multi_cpu os
cpstat os -f perf
ps auxwf
vmstat 2 5
top

Note: This shows drops due to the CPU not being able to cope
watch -n 1 "ethtool -S eth1 | grep rx_no_buffer_count 


 top explanation

%us:Time spent running non-kernel code (User)
%sy: Time spent running kernel code (System)
%ni: Nice time
%id: Time spent idle
%wa: Time spent waiting for IO
%hi: hardware interrupt
%si: Software interrupt
%st: stealth time (Involuntary wait time)

RES (or RSS) For high memory consumption of specific process (for example –fwm)
It is possible also to sort this output, as follows:
Pressing:‘M’ (ctll+M)
sorts the output based on the memory usage (RSS column)
‘P’
sorts the output based on the CPU usage (%CPU column)

The idle value (%id) shows how busy the appliance is.
If the value is 0, the CPU is maxed out. With the
firewall under load, examine the output of idle column (%id) for each CPU and determine if core usage is spread out evenly

High CPU in user time(%us)
indicates that some daemon processis consuming high CPU;
security server processes like fwssd and in.ahttpd have been offenders in the past. (Figure out
which process it is from the output of ps or top)

High CPU usage in system(%sy)
indicates that the Check Point kernel (traffic being inspected by Check Point or SmartDefense) is consuming CPU. Certain configurations in SmartDefense and web-Intelligence can cause this to occur by disabling SecureXL templating or completely disabling SecureXL acceleration.

High CPU in wait time(%wa)
occurs when the CPU was idle due to the system waiting for an outstanding disk I/O requestto complete.This indicates your system is probably low on physical memory and is swapping out memory(paging)*
The CPU is not actually busy if this number is spiking; the CPU is blocked from doing any useful work waiting for an I/O event to complete.The occurrence of paging can be determined by running vmstat -n 5 5 and checking the swapped in (si) and swapped out(so) statistics. Disregard the first line as it is an average value since the appliance started.

A high value against software interrupt (%si)ndicates that there is probably a high load of traffic on the appliance.The interface errors (netstat –i) should be examined to see if this is a cause of concern.



vmstat expanation

how to time stamp vmstat ?
vmstat 1 |awk '{now=strftime("%Y-%m-%d %T "); print now $0}'

Note:First line is system average since it is started so we can ignore 

The ‘procs’ field has 3 columns:
r – The number of processes waiting for run time( task/threads that waiting  in line to get cpu)
task==>task==>CPU1   task==>CPU2   in this case we have 3  threads waiting
so this is in general indicator  of work ask of  CPU and how busy it is
average load of cpu  is track with command uptime  and we can see load average numbers over 1 , 5 and 15 min or in other words  how many threads are running or wanting to run on CPU, averaged over time intervals.
b – The number of processes in uninterruptible sleep (blocked processes/not useful :( )
w – This number is how many threads are moved form RAM(because it is too busy) moved to swap
/virtual memory .

The ‘memory’ field has 4 columns: (see with vmstat -a)
swpd – The amount of used swap space(virtual memory)
free – The amount of idle memory(free RAM/Real memory).
inact – The amount of inactive memory.
active – The amount of active memory.
******************************************************
The ‘swap’ field has 2 columns:
si – Amount of memory swapped in from disk (/sec).
so – Amount of memory swapped to disk (/sec).
******************************************************
The ‘io’ field has 2 columns:
bi – Blocks received from a block device (blocks in).
bo – Blocks sent to a block device (blocks out).
******************************************************
The ‘system’ field has 2 columns:
in – The number of interrupts per second, including the clock (System interrupts).
cs – The number of context switches per second (Process context switches).
******************************************************
The ‘cpu’ field has only 4 columns:
us: Time spent running non-kernel code. (aplications and process used bu user).
sy: Time spent running kernel code. (system time,also time spend serving interrupts).
id: Time spent idle.
wa: Time spent waiting for IO.
******************************************************
CPU Problem:
 if r has numbers in it constantly, threads/tasks waiting to be processed by your  cpu
if in is high, you are handling too many interrupts (likely from disk activity, but could be bad driver)
Processes Problem:
us or sy is high? Some process is being a cpu hog, use top to find it, and kill -9 the PID if needed

Disk Subsystem Overloaded:
wa is high? If you are waiting for IO then you need to upgrade your disk subsystem

Not Enough RAM:
 si and so are high, swapping disk too much. You really shouldn’t swap at all for high performance. If these are high, in will be high too. Upgrade your RAM.

Low Memory:
cs is high? The kernel is paging memory in and out of context. Likely you need more RAM,

Out of Memory:
I ignore free, inact, active because it’s not as useful and understanding the actual reasons.  if you are out of memory, you’ll know that, but unless you look at cs, so, si, etc you won’t know why. So it’s redundant.

Use option -a, to display active and inactive memory information
Use option -m to see memory details
Use option -s to displays the values in the record format

 ******************************************************
free is a command which can give us valuable information on available RAM
free -k -t
-k, --kb Display output in kilobytes (KB). This is the default.
-m, --mb Display output in megabytes (MB).
-g, --gb Display output in gigabytes (GB).
-t, --total Display total summary for physical memory + swap space.
Watch real time changes evey 5 sec
watch -n 5 free -m
******************************************************
Note:SWAP mem is same concept as virtual memory in Windows
The „total? column shows the amount of RAM installed in the system
and the amount of disk space allocated for swap space
The amount of swap space is normally automatically set to twice the size of the physical memory
The „used? column indicates how much RAM and swap space are being used.
The „free? column indicates how much RAM and swap space are available.
If for some reason the amount of free RAM becomes low, the appliance will start to preserve free RAM by swapping out the contents of the memory to the hard disk (swap space).
******************************************************
EXPLANATION OF OUTPUT 
******************************************************
Output:
total used free shared buffers cached
Mem: 8027952 4377300 3650652 0 103648 1630364
-/+ buffers/cache: 2643288 5384664
Swap: 15624188 608948 15015240
******************************************************
Explanation:
Line 1: Indicates Memory details like total available RAM, used RAM, Shared RAM, RAM used for buffers, RAM used of caching content.
Line 2: Indicates total buffers/Cache used and free.
Line 3: Indicates total swap memory available, used swap and free swap memory size available.
******************************************************
Line 1:
Mem: 8027952 4377300 3650652 0 103648 1630364
8027952 : Indicates memory/physical RAM available for your machine. These numbers are in KB's
4377300 : Indicates memory/RAM used by system. This include even buffers and cached data size as well.
3650652 : Indicates Total RAM free and available for new process to run.
0 :  Indicates shared memory. This column is obsolete and may be removed in future releases of free.
103648 : Indicates total RAM buffered by different applications in Linux
1630364 : Indicates total RAM used for Caching of data for future purpose.
******************************************************
Line 2:
2643288 : This is actual size of used RAM which we get from RAM used -(buffers + cache)
A bit of mathematical calculation
Used RAM = +4377300
Used Buffers = -103648
Used Cache = -1630364
Actual Total used RAM is 4377300 -(103648+1630364)= 2643288
So we can see this in second colum
-/+ buffers/cache: 2643288 5384664
5384664 : Indicates actual total RAM available, we get to this number by subtracting actual RAM used from total RAM available in the system.
Total RAM = +8027952
actual used RAM = -2643288
Total actual available RAM = 5384664
******************************************************
Line 3:
Swap: 15624188 608948 15015240
This line indicates swap details like total SWAP size, used as well as free SWAP.
Swap is a virtual memory created on HDD to increase RAM size virtually.
******************************************************
Too see how much ram is free to use for your applications, run free -m and look at the row that says "-/+ buffers/cache" in the column that says "free". That is your answer in megabytes:
$ free -m
                    total       used       free     shared    buffers     cached
Mem:          1504       1491         13          0         91        764
-/+ buffers/cache:        635        869
Swap:         2047          6       2041

you'll think the ram is 99% full when it's really just 42%. (because in colume used is 1491 so it is misguiding)
******************************************************