mirror of
https://github.com/opsschool/curriculum.git
synced 2025-12-06 00:19:45 +01:00
Add sections on network troubleshooting (#256)
* Add sections on network troubleshooting and working with network engineers * Apparently emphasize-lines doesn't work, so I just removed it. * Fix a small typo * Expound on net-tools deprecation. Turns out, only RHEL7 has officially deprecated usage, with other major distros just silently keeping both available. * Fix some style and linking issues * Grammar and style fixes in the netstat section * netstat is deprecated in favor of ss, which I didn't realize until now. Switched the whole netstat section to ss, and as requested by @miketheman, expounded a bit more on the options. * Add an example output for iftop and discuss availibility of the tool * Reword the section on network engineers and systems admins working together to have more empathy and less us vs them * Expound on the merits of ping and telnet as troubleshoot tools. * Fix minor RST syntax issue * Add an example of using ping for MTU path discovery
This commit is contained in:
parent
92caf48d15
commit
dcabb147bd
|
|
@ -214,6 +214,360 @@ IPSec
|
|||
SSL
|
||||
---
|
||||
|
||||
|
||||
Network Troubleshooting
|
||||
=======================
|
||||
|
||||
ping
|
||||
----
|
||||
ping should always be the first step in any network-related troubleshooting session, due to
|
||||
the simple manner in which it works, and the information it returns.
|
||||
Many times, the resulting information from a ping will point you in the next direction.
|
||||
For example, if you notice jitter (ping responses varying wildly), you will know to start
|
||||
looking at layer 1 problems somewhere on the path, and that available bandwidth probably
|
||||
isn't an issue.
|
||||
|
||||
You're probably already familiar with the basic usage of ping, but there are some really handy options
|
||||
that can used (these are on Linux, but also exist in most versions of ping, but under different flags).
|
||||
Here are a couple of my most-often-used options:
|
||||
|
||||
-I - Change the source IP address you're pinging from.
|
||||
For example, there might be multiple IP addresses on your box, and you want to verify that a particular
|
||||
IP address can ping another.
|
||||
This comes in useful when there's more than just default routes on a box.
|
||||
|
||||
-s - Set the packet size.
|
||||
This is useful when debugging MTU mismatches, by increasing the packet size.
|
||||
Use in conjuction with the -M flag to set MTU Path Discovery hint.
|
||||
|
||||
An example of using ping to test MTU:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
user@opsschool ~$ ping -M do -s 1473 upstream-host
|
||||
PING local-host (local-host) 1473(1501) bytes of data.
|
||||
From upstream-host icmp_seq=1 Frag needed and DF set (mtu = 1500)
|
||||
From upstream-host icmp_seq=1 Frag needed and DF set (mtu = 1500)
|
||||
From upstream-host icmp_seq=1 Frag needed and DF set (mtu = 1500)
|
||||
|
||||
I've used the `-M do` option to set the 'Don't Fragment' (DF) flag on the packet, then
|
||||
set the packet size to 1473.
|
||||
With the 28 bytes of overhead for Ethernet, you can see the total packet size becomes
|
||||
1501--just one byte over the MTU of the remote end.
|
||||
As you can see from the example, since the DF flag is set and the packet needs to fragment,
|
||||
it spits back an error, and helpfully tells us what the MTU size is on the other end.
|
||||
ping can be used to determine Path MTU (the smallest MTU size along a path), but other
|
||||
tools are better for that (see below).
|
||||
|
||||
telnet
|
||||
------
|
||||
|
||||
While telnet daemons are a big no-no in the wild (unencrypted traffic), the telnet client
|
||||
utility can be used to test whether TCP connections can be made on the specified port.
|
||||
For example, you can verify a TCP connection to port 80 by connecting via telnet:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
user@opsschool ~$ telnet yahoo.com 80
|
||||
Trying 98.138.253.109...
|
||||
Connected to yahoo.com.
|
||||
Escape character is '^]'.
|
||||
|
||||
A connection failure would look like this (using port 8000, since nothing is listening on that port):
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
user@opsschool ~$ telnet yahoo.com 8000
|
||||
Trying 98.138.253.109...
|
||||
telnet: connect to address 98.138.253.109: Connection timed out
|
||||
|
||||
You can also send raw data via telnet, allowing you to verify operation of the
|
||||
daemon at the other end.
|
||||
For example, we can send HTTP headers by hand:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
user@opsschool ~$ telnet opsschool.org 80
|
||||
Trying 208.88.16.54...
|
||||
Connected to opsschool.org.
|
||||
Escape character is '^]'.
|
||||
GET / HTTP/1.1
|
||||
host: www.opsschool.org
|
||||
|
||||
The last two lines are the commands sent to the remote host.
|
||||
|
||||
Here's the response from the web server:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
HTTP/1.1 302 Found
|
||||
Date: Fri, 26 Dec 2014 14:55:45 GMT
|
||||
Server: Apache
|
||||
Location: http://www.opsschool.org/
|
||||
Vary: Accept-Encoding
|
||||
Content-Length: 276
|
||||
Content-Type: text/html; charset=iso-8859-1
|
||||
|
||||
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
|
||||
<html><head>
|
||||
<title>302 Found</title>
|
||||
</head><body>
|
||||
<h1>Found</h1>
|
||||
<p>The document has moved <a href="http://www.opsschool.org/">here</a>.</p>
|
||||
<hr>
|
||||
<address>Apache Server at www.opsschool.org Port 80</address>
|
||||
</body></html>
|
||||
Connection closed by foreign host.
|
||||
|
||||
Here we passed the bare minimum required to initiate an HTTP session to a remote web server,
|
||||
and it responded with HTTP data, in this case, telling us that the page we requested is located
|
||||
elsewhere.
|
||||
|
||||
Note that the port you're connecting to might the port for HTTP, but it could be something
|
||||
other than an HTTP daemon running.
|
||||
Nothing prevents a service from running on a port other than it's usual one.
|
||||
In effect, you could run a SMTP daemon on port 80, and an HTTP daemon on port 25.
|
||||
Testing TCP connections with telnet would verify TCP operation, but you still would not
|
||||
have a working web server on port 80.
|
||||
Since the scope of this section is focused only on the networking aspect, see the other
|
||||
sections of OpsSchool for troubleshooting daemon operation and Linux troubleshooting.
|
||||
|
||||
iproute / ifconfig
|
||||
------------------
|
||||
|
||||
ifconfig is ubiquitous and a mainstay of any network-related work on Linux, but it's actually
|
||||
`deprecated in RHEL7 <https://bugzilla.redhat.com/show_bug.cgi?id=1119297>`_ (the net-tools
|
||||
package which contains `ifconfig` isn't included in RHEL 7/CentOS 7 by default) and many major
|
||||
distributions include iproute by default.
|
||||
The `ifconfig <http://linux.die.net/man/8/ifconfig>`_ man page also recommends using the iproute package.
|
||||
All examples used below will use iproute, and will cover only the basics of troubleshooting.
|
||||
It's highly recommended to play around and see what you can find.
|
||||
The `ip <http://linux.die.net/man/8/ip>`_ man page contains a wealth of knowledge on the tool.
|
||||
|
||||
ip addr show
|
||||
^^^^^^^^^^^^^^
|
||||
|
||||
Show all IP addresses on all interfaces.
|
||||
Many options can be passed to filter out information.
|
||||
This will show several important pieces of information, such as MAC address, IP address, MTU, and link state.
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
user@opsschool ~$ ip addr show
|
||||
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 16436 qdisc noqueue state UNKNOWN
|
||||
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
|
||||
inet 127.0.0.1/8 scope host lo
|
||||
inet6 ::1/128 scope host
|
||||
valid_lft forever preferred_lft forever
|
||||
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000
|
||||
link/ether 08:00:27:8a:6d:07 brd ff:ff:ff:ff:ff:ff
|
||||
inet 10.0.2.15/24 brd 10.0.2.255 scope global eth0
|
||||
inet6 fe80::a00:27ff:fe8a:6d07/64 scope link
|
||||
valid_lft forever preferred_lft forever
|
||||
|
||||
ip route
|
||||
^^^^^^^^
|
||||
|
||||
Show all routes on the box.
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
user@opsschool ~$ ip route
|
||||
10.0.2.0/24 dev eth0 proto kernel scope link src 10.0.2.15
|
||||
169.254.0.0/16 dev eth0 scope link metric 1002
|
||||
default via 10.0.2.2 dev eth0
|
||||
|
||||
|
||||
ss
|
||||
--
|
||||
|
||||
`ss` is the replacement for `netstat`, which is obsolete according to the `netstat man
|
||||
page <http://linux.die.net/man/8/netstat>`_.
|
||||
While most distributions will probably have netstat available for some time, it is
|
||||
worthwhile to get used to using ss instead, which is already included in the iproute
|
||||
package.
|
||||
|
||||
ss is very useful for checking connections on a box.
|
||||
ss will show SOCKET, TCP, and UDP connections, in various connection states.
|
||||
For example, here's ss showing all TCP and UDP connections in the LISTEN state, with
|
||||
numeric representation.
|
||||
In other words, this shows all daemons listening on UDP or TCP with DNS and port lookup
|
||||
disabled.
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
user@opsschool ~$ ss -tuln
|
||||
Netid State Recv-Q Send-Q Local Address:Port Peer Address:Port
|
||||
tcp LISTEN 0 128 *:80 *:*
|
||||
tcp LISTEN 0 50 *:4242 *:*
|
||||
tcp LISTEN 0 50 :::4242 :::*
|
||||
tcp LISTEN 0 50 *:2003 *:*
|
||||
tcp LISTEN 0 50 *:2004 *:*
|
||||
tcp LISTEN 0 128 :::22 :::*
|
||||
tcp LISTEN 0 128 *:22 *:*
|
||||
tcp LISTEN 0 100 *:3000 *:*
|
||||
tcp LISTEN 0 100 ::1:25 :::*
|
||||
tcp LISTEN 0 100 127.0.0.1:25 *:*
|
||||
tcp LISTEN 0 50 *:7002 *:*
|
||||
|
||||
There are a few things to note in the output.
|
||||
Local address of * means the daemon is listening on all IP addresses the server might have.
|
||||
Local address of 127.0.0.1 means the daemon is listening only to the loopback interface, and therefore
|
||||
won't accept connections from outside of the server itself.
|
||||
Local address of ::: is the same thing as \*, but for IPv6.
|
||||
Likewise, ::1 is the same as 127.0.0.1, but for IPv6.
|
||||
In this example, we used the flags `-tuln`, which just happens to be one of my more-often
|
||||
used sets of flags.
|
||||
|
||||
By default, ss shows only non-listening TCP connections:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
user@opsschool ~$ ss
|
||||
|
||||
State Recv-Q Send-Q Local Address:Port Peer Address:Port
|
||||
ESTAB 0 0 10.0.2.15:ssh 10.0.2.2:64667
|
||||
|
||||
ss has many more useful flags than just these, which you can find in the `ss man page <http://linux.die.net/man/8/ss>`_.
|
||||
|
||||
traceroute
|
||||
----------
|
||||
|
||||
If you've familiarized yourself with the basics of networking, you'll know that networks are comprised of
|
||||
many different routers.
|
||||
The internet is mostly a messy jumble of routers, with multiple paths to end points.
|
||||
traceroute is useful for finding connection problems along the path.
|
||||
|
||||
traceroute works by a very clever mechanism, using UDP packets on Linux or ICMP packets on Windows.
|
||||
traceroute can also use TCP, if so configured.
|
||||
traceroute sends packets with an increasing TTL value, starting the TTL value at 1.
|
||||
The first router (hop) receives the packet, then decrements the TTL value, resulting in the packet
|
||||
getting dropped since the TTL has reached zero.
|
||||
The router then sends an ICMP Time Exceeded back to the source.
|
||||
This response indicates to the source the identity of the hop.
|
||||
The source sends another packet, this time with TTL value 2.
|
||||
The first router decrements it as usual, then sends it to the second router, which decrements to
|
||||
zero and sends a Time Exceeded back.
|
||||
This continues until the final destination is reached.
|
||||
|
||||
An example:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
user@opsschool ~$ traceroute google.com
|
||||
traceroute to google.com (173.194.123.39), 30 hops max, 60 byte packets
|
||||
1 (redacted) (redacted) 1.153 ms 1.114 ms 1.096 ms
|
||||
2 192.241.164.253 (192.241.164.253) 0.226 ms 192.241.164.241 (192.241.164.241) 3.267 ms 192.241.164.253 (192.241.164.253) 0.222 ms
|
||||
3 core1-0-2-0.lga.net.google.com (198.32.160.130) 0.291 ms 0.322 ms 192.241.164.250 (192.241.164.250) 0.201 ms
|
||||
4 core1-0-2-0.lga.net.google.com (198.32.160.130) 0.290 ms 216.239.50.108 (216.239.50.108) 0.980 ms 1.172 ms
|
||||
5 216.239.50.108 (216.239.50.108) 1.166 ms 209.85.240.113 (209.85.240.113) 1.143 ms 1.358 ms
|
||||
6 209.85.240.113 (209.85.240.113) 1.631 ms lga15s47-in-f7.1e100.net (173.194.123.39) 0.593 ms 0.554 ms
|
||||
|
||||
|
||||
mtr
|
||||
---
|
||||
|
||||
mtr is a program that combines the functionality of `ping` and `traceroute` into one utility.
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
user@opsschool ~$ mtr -r google.com
|
||||
HOST: opsschool Loss% Snt Last Avg Best Wrst StDev
|
||||
1. (redacted) 0.0% 10 0.3 0.4 0.3 0.5 0.1
|
||||
2. 192.241.164.237 0.0% 10 0.3 0.4 0.3 0.4 0.0
|
||||
3. core1-0-2-0.lga.net.google.c 0.0% 10 0.4 0.7 0.4 2.8 0.8
|
||||
4. 209.85.248.178 0.0% 10 0.5 1.1 0.4 6.2 1.8
|
||||
5. 72.14.239.245 0.0% 10 0.7 0.8 0.7 1.4 0.2
|
||||
6. lga15s46-in-f5.1e100.net 0.0% 10 0.5 0.4 0.4 0.5 0.0
|
||||
|
||||
mtr can be run continuously or in report mode (-r).
|
||||
The columns are self-explanatory, as they are the same columns seen when running traceroute
|
||||
or ping independently.
|
||||
|
||||
Reading mtr reports can be a skill in itself, since there's so much information packed into them.
|
||||
There are many excellent in-depth guides to mtr that can be found online.
|
||||
|
||||
iftop
|
||||
-----
|
||||
|
||||
iftop displays bandwidth usage on a specific interface, broken down by remote host.
|
||||
You can use filters to filter out data you don't care about, such as DNS traffic.
|
||||
iftop is not available in the base reposities for RHEL/CentOS or Ubuntu, but is available in
|
||||
`EPEL <https://fedoraproject.org/wiki/EPEL>`_, and the Universe repository, respectively.
|
||||
|
||||
In this example, iftop is listening only to the eth0 interface, and for purposes of this document,
|
||||
is also using the -t option, which disables the ncurses interface (for your use, you won't need -t).
|
||||
This box is a very low-traffic VM, so there's not much here, but it does give a sense of what
|
||||
information is available via the tool.
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
user@opsschool ~$ sudo iftop -i eth0 -t
|
||||
interface: eth0
|
||||
IP address is: 10.0.2.15
|
||||
MAC address is: 08:00:27:ffffff8a:6d:07
|
||||
Listening on eth0
|
||||
# Host name (port/service if enabled) last 2s last 10s last 40s cumulative
|
||||
--------------------------------------------------------------------------------------------
|
||||
1 10.0.2.15 => 804b 804b 804b 201B
|
||||
google-public-dns-a.google.com <= 980b 980b 980b 245B
|
||||
2 10.0.2.15 => 352b 352b 352b 88B
|
||||
10.0.2.2 <= 320b 320b 320b 80B
|
||||
--------------------------------------------------------------------------------------------
|
||||
Total send rate: 1.13Kb 1.13Kb 1.13Kb
|
||||
Total receive rate: 1.27Kb 1.27Kb 1.27Kb
|
||||
Total send and receive rate: 2.40Kb 2.40Kb 2.40Kb
|
||||
--------------------------------------------------------------------------------------------
|
||||
Peak rate (sent/received/total): 1.12Kb 1.27Kb 2.40Kb
|
||||
Cumulative (sent/received/total): 289B 325B 614B
|
||||
============================================================================================
|
||||
|
||||
iperf
|
||||
-----
|
||||
|
||||
iperf is a bandwidth testing utility.
|
||||
It consists of a daemon and client, running on separate machines.
|
||||
|
||||
This output shows from the client's side:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
user@opsschool ~$ sudo iperf3 -c remote-host
|
||||
Connecting to host remote-host, port 5201
|
||||
[ 4] local 10.0.2.15 port 45687 connected to x.x.x.x port 5201
|
||||
[ ID] Interval Transfer Bandwidth Retr Cwnd
|
||||
[ 4] 0.00-1.00 sec 548 KBytes 4.48 Mbits/sec 0 21.4 KBytes
|
||||
[ 4] 1.00-2.00 sec 503 KBytes 4.12 Mbits/sec 0 24.2 KBytes
|
||||
[ 4] 2.00-3.00 sec 157 KBytes 1.28 Mbits/sec 0 14.3 KBytes
|
||||
[ 4] 3.00-4.00 sec 0.00 Bytes 0.00 bits/sec 0 14.3 KBytes
|
||||
[ 4] 4.00-5.00 sec 472 KBytes 3.88 Mbits/sec 0 20.0 KBytes
|
||||
[ 4] 5.00-6.00 sec 701 KBytes 5.74 Mbits/sec 0 45.6 KBytes
|
||||
[ 4] 6.00-7.00 sec 177 KBytes 1.45 Mbits/sec 0 14.3 KBytes
|
||||
(snip)
|
||||
|
||||
Some of the really handy options:
|
||||
|
||||
-m - Use the maximum segment size (the largest amount of data, in bytes, that a system can support
|
||||
in an unfragmented TCP segment).
|
||||
This option will use the default size for the particular network media in use (eg, Ethernet is 1500 bytes).
|
||||
|
||||
-M - Set MSS, used in conjuction with the previous -m option to set the MSS to a different value than default.
|
||||
Useful for testing performance at various MTU settings.
|
||||
|
||||
-u - Use UDP instead of TCP.
|
||||
Since UDP is connectionless, this will give great information about jitter and packet loss.
|
||||
|
||||
tcpdump
|
||||
-------
|
||||
|
||||
For occasions where you need to get into the nitty-gritty and look at actual network behavior,
|
||||
tcpdump is the go-to tool.
|
||||
tcpdump will show raw connection details and packet contents.
|
||||
|
||||
Since one could devote entire pages to the usage of tcpdump, it is recommended to search online
|
||||
for any one of the many great guides on tcpdump usage.
|
||||
|
||||
|
||||
Troubleshooting layer 1 problems
|
||||
================================
|
||||
|
||||
|
|
@ -447,3 +801,48 @@ Issues with fiber cable generally fall into two categories:
|
|||
The symptoms for both are very similar: intermittent RX/TX and CRC errors
|
||||
indicate a dirty or scratched fiber, while persistent RX/TX and CRC errors
|
||||
indicate a bad optic/transceiver.
|
||||
|
||||
|
||||
Differences in perspective: network engineering and systems administration
|
||||
==========================================================================
|
||||
|
||||
Network engineering and systems administration have a tendency to speak different languages,
|
||||
due to the divide in skillsets and lack of overlap.
|
||||
|
||||
A good way to view how network engineering sees the technology differently is consider
|
||||
the OSI model: network engineering is focused primarily on layers 1, 2, and 3, with the
|
||||
occasional venture into the higher layers when certain routing protocols are involved (eg,
|
||||
BGP peering sessions operate over TCP).
|
||||
System administrators, on the other hand, are typically more concerned with layers 4 through 7,
|
||||
with the occasional venture into layer 3 for IP addressing.
|
||||
If one considers the perspective of the other, empathy is understanding comes easier, and
|
||||
anticipating what the other side expects becomes straightforward.
|
||||
|
||||
As such, here are a few tips on how the two specializations see the same technology:
|
||||
|
||||
1. Network engineers output in bits-per-second (bps), while many server-specific utilities
|
||||
output in bytes-per-second (Bps).
|
||||
As such, be sure when you're sending throughput data to network engineering that it's in
|
||||
bits-per-second.
|
||||
Your monitoring tools will usually do this for you (look for a config option).
|
||||
In the occasion you need to do it by hand, and you're working in bytes, simply multiply by eight
|
||||
to get bits.
|
||||
For more information on conversions, `wikipedia <http://en.wikipedia.org/wiki/Data_rate_units>`_
|
||||
has a good article on unit measurements.
|
||||
Alternatively, use an online calculator.
|
||||
2. Systems administrators often don't worry about network topology since it so often "just works".
|
||||
However, in some cases, especially situations where you're troubleshooting hosts across the
|
||||
open internet, you may run into something called an 'asymmetrical path', that is, the routing
|
||||
is using a different path out than it does coming back in.
|
||||
In such situations, one path might have issues, while the other path is perfectly fine.
|
||||
For this reason, when sending issue reports to a network engineer, be sure to send
|
||||
a `traceroute`/`mtr` report from *both* directions.
|
||||
This situation is not common on internal networks, but can be on the open Internet.
|
||||
You may also run into this situation when you have more complex routing set up on the local server.
|
||||
3. A simple `ping` sometimes isn't enough for an issue report, for the simple reason that it contains
|
||||
so little information.
|
||||
A `traceroute`/`mtr` report is better.
|
||||
If there's suspected throughput issues, try to get an `iperf` report as well.
|
||||
Also include an interface configuration report, showing MTU, IP address, and MAC address
|
||||
of the relevant interface(s).
|
||||
|
||||
|
|
|
|||
Loading…
Reference in New Issue
Block a user