First steps should include basic system resources checks:

  • After logging in, a simple check can be done by using w command in a terminal, which shows a few basic pieces of information like: uptime, number of logged-in users, and what's most important: an average system load over the given period of time of 1, 5 and 15 minutes. If the load is very high e.g. over 100 or more, it shows that one or more processes demand way more CPU than it's available in the system, which indicates abnormal work of this process.
    Sample output of w command:

    10:51:12 up 43 days, 23:578 users, load average: 0,721,121,36
  • We can check various system resources by using those commands:

    free -mh: On Unix-like operating systems, the free command displays the total amount of free and used physical and swap memory, and the buffers used by the kernel. When the system runs out of memory, it starts to use 'OOM Killer', which stands for: "Out of memory Killer" which is nothing more than a kernel feature that kills processes based on oom_score. In other words, OOM Killer will destroy any process that is allocating too much memory and is the least important to the system, so mostly, the first ones, that get killed, are a user's applications with most memory allocated.
    Sample output of free -mh command:

        total   used    free    shared buff/cache available
    Mem:  62G   44G     610M    2,6G    17G         14G
    Swap: 31G   9,0G    22G


    More complex view and more precise data shown can be achieved by using a different command:

    dstat -vn: It shows way more data and not only memory-related processes, but also cpu, disk and network bandwidth usage, also allow to track in close to real-time what is happening with resources. Sample output of dstat -vn:



    df -i: Before checking the disk space used, a good practice is to check the number of free inodes in the system, as all of them may be used even before we run out of free disk space. Sample of df -i command:

    Filesystem  Inodes      IUsed   IFree       IUse%   Mounted on
    devtmpfs    483115      321     482794      1%      /dev
    tmpfs       484994      2       484992      1%      /dev/shm
    tmpfs       484994      416     484578      1%      /run
    tmpfs       484994      16      484978      1%      /sys/fs/cgroup
    /dev/sda1   20971008    93976   20877032    1%      /


    df -h:
    It shows the amount of total, free and used disk space on all mounted partitions/drives, which can be helpful when determining what is causing the slowdown. Maintaining enough free space on a disk is crucial to keep the system running smoothly, e.g. having a root partition "/" full can destabilize the whole operating system.
    Sample output of df -h command:

    File system Size    Used    Avail.  %Used   Mounted
    /dev/md2    906G    828G    33G     97%     /
    devtmpfs    32G     0       32G     0%      /dev
    tmpfs       32G     24K     32G     1%      /dev/shm
    tmpfs       32G     2,5G    29G     8%      /run


    iotop -aoP:
    a tool that shows current disk read/write/swap/%IO parameters, associated command, and the user under which the command is running. It shows processes with most disk operations, and when one is doing e.g. too many writes to disk, it may slow down the whole system, as other disk operations are put on hold / in a queue because of this demanding process.
    Sample output of iotop -aoP command sorted by most IO% ( used Input/Output operations in percentage -- the least, the better – 100% is the max for the system) :



    top/htop: real-time monitoring tools which focus on various aspects of the system: those commands shows many useful parameters with regards to processes: PID, nice level, exact command with child processes created, owner of the command/user it was started under, running time of the command, % of used cpu and memory, load on the system and a few more. It's a convenient way of finding which process demands most cpu/memory resources and when was it started and by whom. By using the -u USER switch, we can list processes owned by a specific user only.
    Sample output of top command:


    cat /proc/mdstat:  /proc/mdstat is a file maintained by the kernel which contains the real time information about the RAID arrays and devices. For detailed view of individual devices use: mdadm --detail /dev/md0 (instead of md0 use your device name taken from cat /proc/mdstat ).
    Sample output of cat /proc/mdstat:

    Personalities : [raid1]
    md0 : active raid1 sda1[0] sdb1[1]
    511988 blocks super 1.0 [2/2] [UU]
     
    md3 : active raid1 sdb5[1] sda5[0]
    2024086627 blocks super 1.2 [2/2] [UU]
     
    md2 : active raid1 sda3[0] sdb3[1]
    20478908 blocks super 1.1 [2/2] [UU]
     
    md1 : active raid1 sda2[0] sdb2[1]
    102398908 blocks super 1.1 [2/2] [UU]
    bitmap: 1/1 pages [4KB], 65536KB chunk
     
    unused devices: <none>


    A sample of detailed view using mdadm --detail /dev/md0:

    /dev/md0:
    Version : 1.0
    Creation Time : Thu Dec 13 14:55:34 2012
    Raid Level : raid1
    Array Size : 511988 (499.99 MiB 524.28 MB)
    Used Dev Size : 511988 (499.99 MiB 524.28 MB)
    Raid Devices : 2
    Total Devices : 2
    Persistence : Superblock is persistent
     
    Update Time : Sun May 31 01:00:08 2020
    State : clean
    Active Devices : 2
    Working Devices : 2
    Failed Devices : 0
    Spare Devices : 0
     
    Name :example.com:0 (local to host example.com)
    UUID : 9a6bb5a1:421cdbf3:876162a1:93ca42ac
    Events : 1189
     
    Number Major Minor RaidDevice State
    0 8 1 0 active sync /dev/sda1
    1 8 17 1 active sync /dev/sdb1


    smartctl: smartctl is a command line utility that perform SMART tasks such as printing the SMART self-test and error logs, enabling and disabling SMART automatic testing, and initiating device self-tests. It can be used to determine health of disks and overall status of physical devices.
    To see a summary of SMART tests run: smartctl -H /dev/sda (where /dev/sda is the path to the device ).
    Sample output of this command:

    === START OF READ SMART DATA SECTION ===
    SMART overall-health self-assessment test result: PASSED

    smartctl -a /dev/sda: It outputs a lot of detailed informations, but one of the most important is the table with SMART attributes which shows exact records for individual SMART attributes.
    Sample data from the smartctl -a /dev/sda command:

    SMART Attributes Data Structure revision number: 10
    Vendor Specific SMART Attributes with Thresholds:
    ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
    1 Raw_Read_Error_Rate 0x000f 078 063 044 Pre-fail Always - 79071778
    3 Spin_Up_Time 0x0003 092 092 000 Pre-fail Always - 0
    4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 8
    5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 0
    7 Seek_Error_Rate 0x000f 087 042 030 Pre-fail Always - 4906913691
    9 Power_On_Hours 0x0032 026 026 000 Old_age Always - 65697
    10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0
    12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 8
    184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0
    187 Reported_Uncorrect 0x0032 098 098 000 Old_age Always - 2
    188 Command_Timeout 0x0032 100 100 000 Old_age Always - 0
    189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0
    190 Airflow_Temperature_Cel 0x0022 066 060 045 Old_age Always - 34 (Min/Max 24/38)
    191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 0
    192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 5
    193 Load_Cycle_Count 0x0032 100 100 000 Old_age Always - 8
    194 Temperature_Celsius 0x0022 034 040 000 Old_age Always - 34 (0 22 0 0 0)
    195 Hardware_ECC_Recovered 0x001a 018 003 000 Old_age Always - 79071778
    197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0
    198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0
    199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0


    drbdadm status:  this command shows status of drbd ( in case drbd is configured and in use ):
    Sample output showing proper functioning:

    xtm role:Primary
    disk:UpToDate
    xtm.xtm-cloud.com role:Secondary
    peer-disk:UpToDate


    dmesg -xe:
    is a command that prints the message buffer of the kernel. It helps with targeting malfunctioning e.g. drivers of devices or devices themselfs, but is not limited to. Typical output shows a lot if informations, so it can be used with less or more command (which allow for scrolling text in terminal window).
    Sample of dmesg | more command:

    [63708861.865526] md: delaying data-check of md2 until md1 has finished (they share one or more physical units)
    [63708866.774452] md: md1: data-check done.
    [63708866.787701] md: data-check of RAID array md2
    [63708866.787703] md: minimum _guaranteed_ speed: 1000 KB/sec/disk.
    [63708866.787706] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for data-check.
    [63708866.787710] md: using 128k window, over a total of 20478908k.
    [63709065.096909] md: md2: data-check done.
    [63841650.792185] TCP: Peer 173.209.86.18:25/54454 unexpectedly shrunk window 354567372:354567422 (repaired)
    [63841651.535664] TCP: Peer 173.209.86.18:25/54454 unexpectedly shrunk window 354567372:354567422 (repaired)
    [63841653.022798] TCP: Peer 173.209.86.18:25/54454 unexpectedly shrunk window 354567372:354567422 (repaired)
    [63841655.997055] TCP: Peer 173.209.86.18:25/54454 unexpectedly shrunk window 354567372:354567422 (repaired)
    --More--



Next, we need to investigate further:

  • Redis:

    In case of problems with a slow running application, for example Workbench, you can make sure that redis is working properly and no performance errors are thrown in the logs. Usually the target path where we can check the redis logs is as follows: /var/log/redis/redis.log
    Sample output of redis logs showing proper functioning cat /var/log/redis/redis.log ): 

    31116:C 25 May 14:19:09.999 * RDB: 0 MB of memory used by copy-on-write
    18858:M 25 May 14:19:10.093 * Background saving terminated with success
    18858:M 25 May 14:24:11.100 10 changes in 300 seconds. Saving...
    18858:M 25 May 14:24:11.101 * Background saving started by pid 32718
    32718:C 25 May 14:24:13.403 * DB saved on disk
    32718:C 25 May 14:24:13.404 * RDB: 0 MB of memory used by copy-on-write
    18858:M 25 May 14:24:13.505 * Background saving terminated with success


  • Garbage Collector (GC)

    There is also a need of checking a garbage collector logs, because when we find out, that e.g. the running time of GC is 50s during every minute or there are many logs coming every second from GC, then this might suggest too low memory allocated for an application. Note, that if there is a major problem with Garbage Collector, the logs will show ' FullGC '.
    It can be checked using: cat catalina.out | grep GC
    Sample output of normal GC logs:

    2020-06-03T11:10:13.259+0200: [GC (Allocation Failure) 2020-06-03T11:10:13.259+0200: [ParNew: 279595K->7705K(307200K), 0.0035476 secs] 376273K->104389K(989888K) icms_dc=0 0.0035875 secs] [Times: user=0.01 sys=0.00, real=0.00 secs]
    2020-06-03T11:10:16.681+0200: [GC (Allocation Failure) 2020-06-03T11:10:16.682+0200: [ParNew: 2297912K->49674K(2530368K), 0.0368861 secs] 10907536K->8662380K(31176192K) icms_dc=0 0.0383446 secs] [Times: user=0.74 sys=0.05, real=0.04 secs]
    2020-06-03T11:10:13.880+0200: [GC (Allocation Failure) [PSYoungGen: 33192766K->427454K(33791488K)] 78245240K->45487799K(103696896K), 0.1031852 secs] [Times: user=1.61 sys=0.59, real=0.10 secs]


  • Postgresql

    Log into your postgresql instance using: psql and then switch to postgres database using ' \c postgres '  – note that you may have different settings, so if needed, use your credentials and proper switches: ( psql -U username - W -d postgres ). Finally, run this query:

    select pid,
    usename,
    pg_blocking_pids(pid) as blocked_by,
    query as blocked_query
    from pg_stat_activity
    where cardinality(pg_blocking_pids(pid)) > 0;

    If there are no errors (or stuck queries) you should see output similar to this:

    pid | usename | blocked_by | blocked_query
    -----+---------+------------+---------------
    (0 rows)