?

Log in

No account? Create an account
Technical problems and their solutions' Journal
 
[Most Recent Entries] [Calendar View] [Friends]

Below are the 12 most recent journal entries recorded in Technical problems and their solutions' LiveJournal:

Thursday, October 3rd, 2013
9:08 am
[dimrub]
A base64-encoded deja vu
I was writing an HTTP client of sorts that would be uploading files to an HTTP server I had also control over, using a base64-encoded multipart HTTP request. Everything worked fine until I added a test that would transfer files of small random sizes. Every once in a while the uploads would fail, and the server would report "malformed mime body" (or something along these lines). Digging into the server code, I discovered the server thinks the base64-encoded message is invalid, due to padding of incorrect length (the size of a base64-encoded message must be divisible by 4). After digging some more in a rather complex and multutiered code, I gave up, and instead modified my test to transfer files of ever increasing size. To my surprise, I immediately saw the system in the madness: files of sizes 1-57 went through just fine, then sizes 58-114 resulted in error, then it was fine again, then 172-228 again defined a problematic range (so 57 seemed somehow to be a magical number here). At this point I was having a rather strong sense of deja-vu. Both the client encoder, and the server decoder had a decent suite of unit tests, and they were all green. Modifying the inputs in both to be in the problematic size ranges didn't change anything. Out of sheer desperation, I went on, and copied the output of the encoder into the input of the decoder's tests verbatim (with the only difference I could notice from a manually crafted input being that the output of the encoder was broken into multiple lines). Voila! The test went red. From there it was easy to spot the problem: the decoder, when performing the padding verification, did not discard the new line characters, but rather looked at the total size of the encoded data. That meant that whenever the number of line breaks was odd (meaning the number of lines being even), the new line characters (two of them: CR+LF) resulted in the size of the message giving 2 modulo 4. The longest line in the encoded message is 76 characters, perfectly corresponding to the 57 characters of the original message I was observing.
Saturday, March 12th, 2011
7:09 pm
[dimrub]
Out of "space"
This is not really a gotcha, rather a bug I solved a while ago. I was reminded of it a while ago, so decided to put it up here, to make it easy to find and refer to (I described it once before in my Russian-language blog, so I apologize if you've read about it already).

I was working on an embedded system (VXWorks) at the time. It's sort of a single address space single app multiple threads (aka tasks) thing, where any sufficiently severe bug brings down the whole system. I was responsible, among other things, for the management interfaces, consisting, among other things, of a terminal application that gave access (and control) to the internal state of the device. The bug, as was assigned to me, said that listing a certain table (one I had recently added) caused the device to crash (in free(), of all places, with a stack dump pointing to the terminal library). I tried reproducing it for hours, using exactly the same configuration as the QA, and the exact described steps (which were quite simple: run the command that shows the table, scroll till the end of the table, do it a few times, see the device crash and burn in spectacular flames) but in vain. Finally, I gave up, went upstairs to the QA, and started watching closely how the QA engineer successfully reproduces the problem time after time. Eventually I noticed, that his pattern of reproduction looks as follows:

- bring up the last command from history (arrow up)
- hit enter
- hit space many times, to scroll through the table
- repeat until the device has crashed

He performed these actions in quick succession. Usually, he would hit the space much more times than is necessary to scroll through the table. And occasionally, he would brush the up arrow without actually hitting it, so he'd be hitting enter (submitting a command line) on a line full of spaces. That's when the crash would occur. It was easy to find out at this point, that a string of spaces consisting of 16 to 31 spaces causes the crash - and the particular table had nothing to do with it.

I then deep-dived into the terminal library code, and found there a function, that removes trailing white spaces from a command line. This function would start from the end of the line, and go back replacing whitespaces with null characters till it encounters a non-white space character NOT CHECKING WHETHER IT REACHED THE BEGINNING OF THE STRING! It so happens, that an allocated memory block in this system is preceded with a one-byte prefix, in which a value of 0 means "the block is free", and any other value means that the block is allocated, and indicates its size. The memory is allocated in blocks of sizes which are powers of 2. If the string is 17 to 32 characters long (including the terminating null), the value of the prefix will be 32 - which happens to be the ASCII code for whitespace. So the function would happily overwrite the prefix with 0 - upon seeing which (much later), free() would abort().
Thursday, September 17th, 2009
3:02 pm
[dimrub]
TCP timestamp revisited
So, it's been quite a while since I started having problems with accessing google.com and other Google sites from my home laptop, running Kubuntu 9.04. A WinXP station connected via the same router had no problem whatsoever. Finally I've had enough of it, and I had a look at the typical failed connection with Wireshark. And, sure enough, what I saw is that timestamp TCP option is being used (or misused, so it seems). I turned it off - and everything works just fine since then.
Tuesday, July 21st, 2009
10:21 am
[dimrub]
Using SSL session IDs for authentication
A coworker reminded me of a problem I solved a while ago, so I decided to mention it here just to save myself solving it all over again a few years from now.

SSL session ID is a mechanism in SSL/TLS that allows to optimize away the SSL handshake (which is the heaviest part of SSL performance-wise) in repeated connections. It works as follows: when a certain client connects first to an SSL server, the server issues as part of the handshake a random 32-bit number which is the session ID. When the client is trying to reconnect to the same server, it sends that session ID as part of the Client Hello packet. If the server recognizes that session ID, it skips the handshake and uses the key agreed upon during the previous handshake. The servers keep the session ID, depending on configuration, for a number of hours or even days.

There are sites out there, that use this optimization feature as a way to authenticate their clients. It works as follows: when the client first arrives (without the session ID), it is forwarded to an authentication application, then back to the main site. At this point, if the client reconnects without presenting the session ID given to it during the authentication, it is forwarded again to the authentication application, and so on. This is broken in many ways, but there are apps out there that work this way. Proxy and client developers should be aware of that.
Sunday, May 31st, 2009
6:19 pm
[dimrub]
Browser refuses to open an SSL session through a proxy
Problem:When browsing through a proxy, the browser (IE, both version 6 and 7) refuses to access HTTPS sites. No compelling explanation is provided (the generic "Internet Explorer cannot display the webpage" message is being shown). Firefox allows to surf after a security exception has been acknowledged, but shows only the server's certificate, although the signing certificate should also be present.

Analysis: The proxy acts as a man in the middle, intercepting the CONNECT requests, performing the handshake against the server while acting as a client, then resigning the server's certificate with it's own signing certificate and using the new certificate to perform the handshake against the client acting on behalf (and disguised as) the server. In order for this to work, the signing certificate installed on the proxy, or the certificate of its issuer should be recognized as trusted by the browser. In fact, the procedure is as follows:
1. Create the CSR on the proxy
2. Sign the CSR on the CA of the enterprise
3. Import the new signing certificate back into the proxy
Now the browser should consider the new "fake" certificate as trusted, but it doesn't just yet: all of the signing certificates should have the basic constraints X509v3 extension defined with the value of CA=true. This should be done during either one of stages 1,2 above. In order to make it so for a CSR created through openssl, the following should be added to the openssl.cnf file:

[ req ]
...
req_extensions = v3_req
...

[ v3_req ]
basicConstraints = CA:true

Apparently, the CA will sometimes override these settings defined in the CSR, so one has to make sure that the resulting certificate indeed defines this extension - e.g. by running

openssl x509 -in cert.pem -text -noout

and looking for the following lines:

X509v3 Basic Constraints: critical
CA:TRUE
Tuesday, March 10th, 2009
2:13 pm
[dimrub]
Wrong certificate in an SSL session
Problem: A customer has been complaining, that once in a while, while surfing over HTTPS through our proxy (which acts as a man in the middle of sorts for the sake of SSL handling), he gets a wrong certificate for some of the sessions. E.g., he tries to browse to a.com, but gets a certificate for b.com instead.

Analysis: studying the traffic captures reveals that in the offending session, no Certificate record is found, but rather a session reuse was employed. Comparison with the other session, that was directed to the other site (b.com) further revealed that the same session ID was used for the 2 sessions, and thus the second session was using the certificate cached for the first session, hence the confusion.

Failed workaround: trying to disable session caching failed. The code that decides whether the session caching on the client side will be used looks as follows:

    if ( cache_sessions )
    {
        SSL_CTX_set_session_cache_mode( m_clientsContextPtr->context(), 
                                        SSL_SESS_CACHE_SERVER );
    }


this code assumes, that if the caching is disabled, the SSL_CTX_set_session_cache_mode will not be called, and hence, session caching will not be used. This is wrong: session caching is ON by default, so if we want it not to be used, we have to specify so explicitly.

Further analysis: the clash of session IDs seems to be related to the Debian's bug of fame, in which the OpenSSL's PRNG was effectively reduced to a coin flip. We expect that upgrade of the corresponding packages (that is, upgrade to a version of our software that contains the replacement openssl) should solve the problem of collisions. the version installed at this customers' contains the up to date version of OpenSSL, so that's not it. I'll be banging my head against the wall some more on this one.

P.S.: here's a way to make sure the proxy generates unique Session IDs.
1. Download and apply this patch for OpenSSL (it must be slightly modified to fit the current version of OpenSSL).
2. Run the following command line:
for (( i=0; i < 10000; i+=1 )); do echo "" | ./openssl s_client -connect server:443 -proxy proxy:8443 2>&1| grep 'Session-ID:' | sed 's/^.*: //' >> ids; done
3. The file ids now contains the session IDs of 10000 sessions. It can now be checked for repetitions:
sort ids | uniq -d
Friday, March 6th, 2009
12:23 am
[dimrub]
Multiple DUP ACKs on a perfectly normal connection
Problem: One of the customers complains, that trying to access a certain site through our system (which is an HTTP proxy of sorts) takes quite a lot of time.

Analysis: Inspection of the traffic from the proxy to the server reveals a large number of DUP ACKs on some of the sessions to the site in question (henceforth ba.com). The beginning of the session appears normal, but at some point the packets arriving from the server, though still apparently normal, are not being accepted by our stack, judging by the fact that a DUP ACK is being issued immediately after each one of those packets.

Further analysis reveals, that a Timestamp option is being used in these sessions, and that at some point the server sends a TSval of 0. We send back a TScer of 0, according to the RFC, but then the server sends a normal TSval again, which our stack rejects (and the session is as good as done with at this point in time).

Solution: echo 0 > /proc/sys/net/ipv4/tcp_timestamp
Wednesday, March 4th, 2009
2:39 pm
[dimrub]
Recursive initialization of a local static variable with gcc4.3
Problem: a process (that, among other things, contains some singletons implemented by using static local variables) gets stuck on startup.

Analysis: initially, the singletons were implemented via statics defined in the global scope. This was dangerous, because the order of initialization of globals is undefined among the different compilation units, and in fact, the compilation unit that contained the definition of that static HAD to be specified last in the list of libraries, so that it's globals be initialized first. This assumption is hard to enforce, and its future existence is not guaranteed (not to mention possible other problems of interaction with other such globals).

The solution proposed, for example, by Meyers, is to move the global statics into a local scope. Such statics are being inited the first time the function which contains them is being called. This makes sense for a singleton, whose instance is always accessed through the instance() method.

In this particular instance, however, the aforementioned change caused the process to get stuck during the startup. It was always stuck in something called __cxa_guard_acquire(). This function, it appears, is part of the C++ ABI (Application Binary Interface), and is something g++ uses to ensure some degree of thread safety for the local statics initialization. Apparently, this guard is non-reentrant. There was a code flow during the initialization of the aforementioned static, that caused the same static to be used, thus causing reentry to the guard. The problem was solved by breaking this cyclic dependence.
Wednesday, November 5th, 2008
4:46 pm
[dimrub]
RTSP over HTTP over a proxy
Problem: Tunneling of RTSP stream over HTTP doesn't work if there is a proxy in the way

Analysis: The uber-protocol that the Real client and server use to communicate between themselves is a marvel of engineering, on par with the Pizza tower and Lenin's mausoleum. It works as follows:

1. The client tries to connect to the following ports: 80, 8080, 7070, 554.
For each port it works with, it tries to work in something called "single post"
mode. meaning:
2. It connects to the server (via proxy) on connection #1 and issues a GET
request for the media (containing a unique string called GUID)
3. It connects to the server (via proxy) on connection #2 and issues a POST
request of specific type (containing the same GUID as the GET request above).
The Content-Length of this request is 32767 (this way they hope to be able to
send lots of data over it without the proxy closing the connection).
4. If the server sees a POST request within one second from a GET request with
the same GUID, it starts to send the media over connection #1, while using
connection #2 as a control channel (receiving commands from the client there).
5. Otherwise, the GET response contains a code that tells the client that the
server did not receive a POST request within the specified time.
6. Theoretically, after #5 occurs, the client is supposed to switch to something
called "multi-post" mode, in which a separate POST request with a correct
Content-Length is issued for every command that a client wishes to send to the
server, but in practice I was unable to make the client switch into this mode. In fact, I've noticed the following disturbing fact: if the proxy port is one of the two (3128, 8080), then no multi-post mode is used. If, on the other hand, the port is any other port, it is used (even though it fails immediately).

Solution: Either make your proxy allow partial POST requests through, or just give up on tunneling RTSP.

P.S. Apparently, the same thing (POST requests with unimaginable Content-Length) is characteristic to the ActiveX by WebEx TM.
Thursday, October 23rd, 2008
7:58 am
[dimrub]
Linux ignores SYN packets
Problem: once in a while, a Linux server will not accept connections. A capture shows 3 SYN packets coming from the same client (one, after 3 seconds - another, after another 6 seconds - the third one), all unanswered. Sited several times in CentOS, but to me it actually happened on Debian.

Analysis: In our case, iptables where installed, with the rule in the INPUT chain, that was supposed to DROP packets in INVALID state. For some reason, it was dropping those absolutely valid SYNs

Solution: Removing the rule solved the problem.
Tuesday, October 21st, 2008
12:24 pm
[dimrub]
WCCP in mask mode - constant source ports
Problem: A WCCP router stops monitoring a service after the client has been restarted

Analysis: Certain Cisco routers remember the UDP source ports, and answer back to the ports they remember - instead of answering on the same port that the datagram came from. There are reasons to believe Cisco only use source port 2048 in their client (and perform services demultiplexing by looking at the content of the packet itself).

Solution: Perform access to the router from a constant source port. Take care of multiple routers and multiple services - perhaps, the source port should be constant per any combination of the above.
Sunday, October 19th, 2008
5:07 pm
[dimrub]
ActiveX not loaded over SSL in IE if caching is not allowed
Problem: An HTML page has an embedded ActiveX, that does not start up in IE if loaded over HTTPS.

Analysis: To make sure the problem you've encountered is the same described here, load the page in IE using Fiddler (or any other application that allows you to see the HTTP headers of the response), and look for the following headers:

Pragma: no-cache
Cache-Control: no-cache

If indeed any of those appears, you've encountered a known issue in IE: it can't load an ActiveX, if it is loaded over SSL, and it is uncacheable.

Solution: If you have access to the server on which the ActiveX is hosted, make sure that the above headers are not there when the ActiveX is accessed via HTTPS.

Links: http://support.microsoft.com/kb/316431 (this article talks about embedded documents, but the issue is the same with embedded ActiveX).
About LiveJournal.com